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Preface 


In recent years, random matrices have come to play a major role in computational mathematics, 
but most of the classical areas of random matrix theory remain the province of experts. Over 
the last decade, with the advent of matrix concentration inequalities, research has advanced to 
the point where we can conquer many (formerly) challenging problems with a page or two of 
arithmetic. 

My aim is to describe the most successful methods from this area along with some interesting 
examples that these techniques can illuminate. I hope that the results in these pages will inspire 
future work on applications of random matrices as well as refinements of the matrix concentra¬ 
tion inequalities discussed herein. 

I have chosen to present a coherent body of results based on a generalization of the Laplace 
transform method for establishing scalar concentration inequalities. In the last two years, Lester 
Mackey and I, together with our coauthors, have developed an alternative approach to matrix 
concentration using exchangeable pairs and Markov chain couplings. With some regret, I have 
chosen to omit this theory because the ideas seem less accessible to a broad audience of re¬ 
searchers. The interested reader will find pointers to these articles in the annotated bibliography. 

The work described in these notes reflects the influence of many researchers. These include 
Rudolf Ahlswede, Rajendra Bhatia, Eric Carlen, Sourav Chatterjee, Edward Effros, Elliott Lieb, 
Roberto Imbuzeiro Oliveira, Denes Petz, Gilles Pisier, Mark Rudelson, Roman Vershynin, and 
Andreas Winter. I have also learned a great deal from other colleagues and friends along the way. 

I would like to thank some people who have helped me improve this work. Several readers 
informed me about errors in the initial version of this manuscript; these include Serg Bogdanov, 
Peter Forrester, Nikos Karampatziakis, and Guido Lagos. The anonymous reviewers tendered 
many useful suggestions, and they pointed out a number of errors. Sid Barman gave me feedback 
on the final revisions to the monograph. Last, I want to thank Leon Nijensohn for his continuing 
encouragement. 
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CHAPTER 



Introduction 


Random matrix theory has grown into a vital area of probability, and it has found applications 
in many other fields. To motivate the results in this monograph, we begin with an overview of 
the connections between random matrix theory and computational mathematics. We introduce 
the basic ideas underlying our approach, and we state one of our main results on the behavior 
of random matrices. As an application, we examine the properties of the sample covariance 
estimator, a random matrix that arises in statistics. Afterward, we summarize the other types of 
results that appear in these notes, and we assess the novelties in this presentation. 

1.1 Historical Origins 

Random matrix theory sprang from several different sources in the first half of the 20th century. 

Geometry of Numbers. Peter Forrester [ForlO, p. v] traces the field of random matrix theory to 
work of Flurwitz, who defined the invariant integral over a Lie group. Specializing this 
analysis to the orthogonal group, we can reinterpret this integral as the expectation of a 
function of a uniformly random orthogonal matrix. 

Multivariate Statistics. Another early example of a random matrix appeared in the work of John 
Wishart [Wis28] . Wishart was studying the behavior of the sample covariance estimator for 
the covariance matrix of a multivariate normal random vector. He showed that the estima¬ 
tor, which is a random matrix, has the distribution that now bears his name. Statisticians 
have often used random matrices as models for multivariate data [MKB79, Mui82] . 

Numerical Linear Algebra. In their remarkable work [vNG47, GvN51] on computational meth¬ 
ods for solving systems of linear equations, von Neumann and Goldstine considered a 
random matrix model for the floating-point errors that arise from an LU decomposition. 1 
They obtained a high-probability bound for the norm of the random matrix, which they 

1 von Neumann and Goldstine invented and analyzed this algorithm before they had any digital computer on which 
to implement it! See [Grcll] for a historical account. 
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took as an estimate for the error the procedure might typically incur. Curiously, in sub¬ 
sequent years, numerical linear algebraists became very suspicious of probabilistic tech¬ 
niques, and only in recent years have randomized algorithms reappeared in this field. See 
the surveys [Mahll, HMT11, Wool4] for more details and references. 

Nuclear Physics. In the early 1950s, physicists had reached the limits of deterministic analyt¬ 
ical techniques for studying the energy spectra of heavy atoms undergoing slow nuclear 
reactions. Eugene Wigner was the first researcher to surmise that a random matrix with 
appropriate symmetries might serve as a suitable model for the Hamiltonian of the quan¬ 
tum mechanical system that describes the reaction. The eigenvalues of this random matrix 
model the possible energy levels of the system. See Mehta’s book [Meh04, §1.1] for an ac¬ 
count of all this. 

In each area, the motivation was quite different and led to distinct sets of questions. Later, 
random matrices began to percolate into other fields such as graph theory (the Erdos-Renyi 
model [ER60] for a random graph) and number theory (as a model for the spacing of zeros of 
the Riemann zeta function [Mon73]). 

1.2 The Modern Random Matrix 

By now, random matrices are ubiquitous. They arise throughout modern mathematics and 
statistics, as well as in many branches of science and engineering. Random matrices have sev¬ 
eral different purposes that we may wish to distinguish. They can be used within randomized 
computer algorithms; they serve as models for data and for physical phenomena; and they are 
subjects of mathematical inquiry. This section offers a taste of these applications. Note that the 
ideas and references here reflect the author’s interests, and they are far from comprehensive! 

1.2.1 Algorithmic Applications 

The striking mathematical properties of random matrices can be harnessed to develop algo¬ 
rithms for solving many different problems. 

Computing Matrix Approximations. Random matrices can be used to develop fast algorithms 
for computing a truncated singular-value decomposition. In this application, we multiply 
a large input matrix by a smaller random matrix to extract information about the dominant 
singular vectors of the input matrix. The seed of this idea appears in [FKV98, DFK + 99]. 
The survey [HMT11] explains how to implement this method in practice, while the two 
monographs [Mahll, Wool4] cover more theoretical aspects. 

Sparsification. One way to accelerate spectral computations on large matrices is to replace the 
original matrix by a sparse proxy that has similar spectral properties. An elegant way to 
produce the sparse proxy is to zero out entries of the original matrix at random while 
rescaling the entries that remain. This approach was proposed in [AM01, AM07], and the 
papers [AKL13, KD14] contain recent innovations. Related ideas play an important role in 
Spielman and Teng’s work [ST04] on fast algorithms for solving linear systems. 

Subsampling of Data. In large-scale machine learning, one may need to subsample data ran¬ 
domly to reduce the computational costs of fitting a model. For instance, we can combine 
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random sampling with the Nystrom decomposition to obtain a randomized approxima¬ 
tion of a kernel matrix. This method was introduced by Williams & Seeger [WS01], The 
paper [DM05] provides the first theoretical analysis, and the survey [GM14] contains more 
complete results. 

Dimension Reduction. A basic template in the theory of algorithms invokes randomized pro¬ 
jection to reduce the dimension of a computational problem. Many types of dimension 
reduction are based on properties of random matrices. The two papers [JL84, Bou85] 
established the mathematical foundations of this approach. The earliest applications in 
computer science appear in the work [LLR95] . Many contemporary variants depend on 
ideas from [AC09] and [CW13]. 

Combinatorial Optimization. One approach to solving a computationally difficult optimiza¬ 
tion problem is to relax (i.e., enlarge) the constraint set so the problem becomes tractable, 
to solve the relaxed problem, and then to use a randomized procedure to map the solution 
back to the original constraint set [BTN01, §4.3]. This technique is called relaxation and 
rounding. For hard optimization problems involving a matrix variable, the analysis of the 
rounding procedure often involves ideas from random matrix theory [So09, NRV13]. 

Compressed Sensing. When acquiring data about an object with relatively few degrees of free¬ 
dom as compared with the ambient dimension, we may be able to sieve out the important 
information from the object by taking a small number of random measurements, where 
the number of measurements is comparable to the number of degrees of freedom [GGI + 02, 
CRT06, Don06] . This observation is now referred to as compressed sensing. Random matri¬ 
ces play a central role in the design and analysis of measurement procedures. For example, 
see [FR13, CRPW12, ALMT14, Trol4], 

1.2.2 Modeling 

Random matrices also appear as models for multivariate data or multivariate phenomena. By 

studying the properties of these models, we may hope to understand the typical behavior of a 

data-analysis algorithm or a physical system. 


Sparse Approximation for Random Signals. Sparse approximation has become an important 
problem in statistics, signal processing, machine learning and other areas. One model for 
a “typical” sparse signal poses the assumption that the nonzero coefficients that generate 
the signal are chosen at random. When analyzing methods for identifying the sparse set of 
coefficients, we must study the behavior of a random column submatrix drawn from the 
model matrix [Tro08a, Tro08b] . 

Demixing of Structured Signals. In data analysis, it is common to encounter a mixture of two 
structured signals, and the goal is to extract the two signals using prior information about 
the structures. A common model for this problem assumes that the signals are randomly 
oriented with respect to each other, which means that it is usually possible to discriminate 
the underlying structures. Random orthogonal matrices arise in the analysis of estimation 
techniques for this problem [MT14, ALMT14, MT13]. 
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Stochastic Block Model. One probabilistic framework for describing community structure in 
a network assumes that each pair of individuals in the same community has a relation¬ 
ship with high probability, while each pair of individuals drawn from different commu¬ 
nities has a relationship with lower probability. This is referred to as the stochastic block 
model [HLL83]. It is quite common to analyze algorithms for extracting community struc¬ 
ture from data by positing that this model holds. See [ABH14] for a recent contribution, as 
well as a summary of the extensive literature. 

High-Dimensional Data Analysis. More generally, random models are pervasive in the analy¬ 
sis of statistical estimation procedures for high-dimensional data. Random matrix theory 
plays a key role in this field [MKB79, Mui82, Kolll, BvdGll]. 

Wireless Communication. Random matrices are commonly used as models for wireless chan¬ 
nels. See the book of Tulino and Verdu for more information [TV04] . 

In these examples, it is important to recognize that random models may not coincide very well 

with reality, but they allow us to get a sense of what might be possible in some generic cases. 

1.2.3 Theoretical Aspects 

Random matrices are frequently studied for their intrinsic mathematical interest. In some fields, 

they provide examples of striking phenomena. In other areas, they furnish counterexamples to 

“intuitive" conjectures. Here are a few disparate problems where random matrices play a role. 

Combinatorics. An expander graph has the property that every small set of vertices has edges 
linking it to a large proportion of the vertices. The expansion property is closely related to 
the spectral behavior of the adjacency matrix of the graph. The easiest construction of an 
expander involves a random matrix argument [ASOO, §9.2]. 

Numerical Analysis. For worst-case examples, the Gaussian elimination method for solving a 
linear system is not numerically stable. In practice, however, stability problems rarely 
arise. One explanation for this phenomenon is that, with high probability, a small random 
perturbation of any fixed matrix is well conditioned. As a consequence, it can be shown 
that Gaussian elimination is stable for most matrices [SST06] . 

High-Dimensional Geometry. Dvoretzky’s Theorem states that, when N is large, the unit ball 
of each A-dimensional Banach space has a slice of dimension n ~ loglV that is close to a 
Euclidean ball with dimension n. It turns out that a random slice of dimension n realizes 
this property [Mil71]. This result can be framed as a statement about spectral properties 
of a random matrix [Gor85] . 

Quantum Information Theory. Random matrices appear as counterexamples for a number of 
conjectures in quantum information theory. Here is one instance. In classical information 
theory, the total amount of information that we can transmit through a pair of channels 
equals the sum of the information we can send through each channel separately. It was 
conjectured that the same property holds for quantum channels. In fact, a pair of quantum 
channels can have strictly larger capacity than a single channel. This result depends on a 
random matrix construction [Has09]. See [HW08] for related work. 
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1.3 Random Matrices for the People 

Historically, random matrix theory has been regarded as a very challenging field. Even now, 
many well-established methods are only comprehensible to researchers with significant experi¬ 
ence, and it may take months of intensive effort to prove new results. There are a small number 
of classes of random matrices that have been studied so completely that we know almost every¬ 
thing about them. Yet, moving beyond this terra firma , one quickly encounters examples where 
classical methods are brittle. 

We hope to democratize random matrix theory. These notes describe tools that deliver use¬ 
ful information about a wide range of random matrices. In many cases, a modest amount of 
straightforward arithmetic leads to strong results. The methods here should be accessible to 
computational scientists working in a variety of fields. Indeed, the techniques in this work have 
already found an extensive number of applications. 

1.4 Basic Questions in Random Matrix Theory 

Random matrices merit special attention because they have spectral properties that are quite 
different from familiar deterministic matrices. Here are some of the questions we might want to 
investigate. 

• What is the expectation of the maximum eigenvalue of a random Hermitian matrix? What 
about the minimum eigenvalue? 

• How is the maximum eigenvalue of a random Hermitian matrix distributed? What is the 
probability that it takes values substantially different from its mean? What about the min¬ 
imum eigenvalue? 

• What is the expected spectral norm of a random matrix? What is the probability that the 
norm takes a value substantially different from its mean? 

• What about the other eigenvalues or singular values? Can we say something about the 
“typical” spectrum of a random matrix? 

• Can we say anything about the eigenvectors or singular vectors? For instance, is each one 
distributed almost uniformly on the sphere? 

• We can also ask questions about the operator norm of a random matrix acting as a map be¬ 
tween two normed linear spaces. In this case, the geometry of the domain and codomain 
play a role. 

In this work, we focus on the first three questions above. We study the expectation of the extreme 
eigenvalues of a random Hermitian matrix, and we attempt to provide bounds on the probability 
that they take an unusual value. As an application of these results, we can control the expected 
spectral norm of a general matrix and bound the probability of a large deviation. These are the 
most relevant problems in many (but not all!) applications. The remaining questions are also 
important, but we will not touch on them here. We recommend the book [Taol2] for a friendly 
introduction to other branches of random matrix theory. 
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1.5 Random Matrices as Independent Sums 

Our approach to random matrices depends on a fundamental principle: 

In applications, it is common that a random matrix can be expressed as a sum of 
independent random matrices. 

The examples that appear in these notes should provide ample evidence for this claim. For now, 
let us describe a specific problem that will serve as an illustration throughout the Introduction. 
We hope this example is complicated enough to be interesting but simple enough to elucidate 
the main points. 

1.5.1 Example: The Sample Covariance Estimator 

Let x - (Xi,..., X p ) be a complex random vector with zero mean: Ex = 0. The covariance matrix 
A of the random vector x is the positive-semidefinite matrix 

P 

A=E[xx*)= £ E{XjX* k )E jk (1.5.1) 

j,k =1 

The star * refers to the conjugate transpose operation, and the standard basis matrix E jk has 
a one in the [j, k ) position and zeros elsewhere. In other words, the ( j, k) entry of the sample 
covariance matrix A records the covariance between the /th and fcth entry of the vector x. 

One basic problem in statistical practice is to estimate the covariance matrix from data. 
Imagine that we have access to n independent samples x\,...,x n , each distributed the same way 
as x. The sample covariance estimator Y is the random matrix 

F=-f> fc x*. (1.5.2) 

n k=l 

The random matrix Y is an unbiased estimator 2 for the sample covariance matrix: E Y — A. Ob¬ 
serve that the sample covariance estimator Y fits neatly into our paradigm: 

The sample covariance estimator can be expressed as a sum of independent ran¬ 
dom matrices. 

This is precisely the type of decomposition that allows us to apply the tools in these notes. 

1.6 Exponential Concentration Inequalities for Matrices 

An important challenge in probability theory is to study the probability that a real random vari¬ 
able Z takes a value substantially different from its mean. That is, we seek a bound of the form 

P{|Z-EZ[>f}< ??? (1.6.1) 

2 The formula (1.5.2) supposes that the random vector x is known to have zero mean. Otherwise, we have to make 
an adjustment to incorporate an estimate for the sample mean. 




1.6. EXPONENTIAL CONCENTRATION INEQUALITIES FOR MATRICES 


7 


for a positive parameter t. When Z is expressed as a sum of independent random variables, the 
literature contains many tools for addressing this problem. See [BLM13] for an overview. 

Forarandom matrix Z, avariant of (1.6.1) is the question ofwhetherZ deviates substantially 
from its mean value. We might frame this question as 

P{||Z-EZ|| > t}< ??? . (1.6.2) 

Here and elsewhere, || • || denotes the spectral norm of a matrix. As noted, it is frequently pos¬ 
sible to decompose Z as a sum of independent random matrices. We might even dream that 
established methods for studying the scalar concentration problem (1.6.1) extend to (1.6.2). 

1.6.1 The Bernstein Inequality 

To explain what kind of results we have in mind, let us return to the scalar problem (1.6.1). First, 
to simplify formulas, we assume that the real random variable Z has zero mean: EZ = 0. If not, 
we can simply center the random variable by subtracting its mean. Second, and more restric- 
tively, we suppose that Z can be expressed as a sum of independent, real random variables. 

To control Z, we rely on two types of information: global properties of the sum (such as its 
mean and variance) and local properties of the summands (such as their maximum fluctuation). 
These pieces of data are usually easy to obtain. Together, they determine how well Z concen¬ 
trates around zero, its mean value. 

Theorem 1.6.1 (Bernstein Inequality). Let Si,..., S n be independent, centered, real random vari¬ 
ables, and assume that each one is uniformly bounded: 

ESfc = 0 and |Sfc|<L for each k = 1,.. ,,n. 

Introduce the sum Z — £? =1 Sfc, and let v{Z ) denote the variance of the sum: 

n 

v{Z) = EZ 2 = £ESf. 

k= l 


Then 

I ~t 2 12 \ 

P{|Z[ > t] < 2 exp - for all t> 0. 

P U(Z) + Lf/3j 

See [BLM13, §2.8] for a proof of this result. We refer to Theorem 1.6.1 as an exponential con¬ 
centration inequality because it yields exponentially decaying bounds on the probability that Z 
deviates substantially from its mean. 

1.6.2 The Matrix Bernstein Inequality 

What is truly astonishing is that the scalar Bernstein inequality, Theorem 1.6.1, lifts directly to 
matrices. Let us emphasize this remarkable fact: 

There are exponential concentration inequalities for the spectral norm of a sum 
of independent random matrices. 
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As a consequence, once we decompose a random matrix as an independent sum, we can harness 
global properties (such as the mean and the variance) and local properties (such as a uniform 
bound on the summands) to obtain detailed information about the norm of the sum. As in the 
scalar case, it is usually easy to acquire the input data for the inequality. But the output of the 
inequality is highly nontrivial. 

To illustrate these claims, we will state one of the major results from this monograph. This 
theorem is a matrix extension of Bernstein’s inequality that was developed independently in the 
two papers [OlilOa, Trollc]. After presenting the result, we give some more details about its in¬ 
terpretation. In the next section, we apply this result to study the covariance estimation problem. 

Theorem 1.6.2 (Matrix Bernstein). LetS i,..., S n be independent, centered random matrices with 
common dimension d\ x d 2 , and assume that each one is uniformly bounded 

ESfc = 0 and ||Sfc||<L for each k-l,...,n. 


Introduce the sum 

Z=£s fc , (1.6.3) 

k=l 

and let u(Z) denote the matrix variance statistic of the sum: 


Then 


Furthermore, 


v{Z) = max{ ||E(ZZ*)||, ||E(Z*Z)||} 


: max 


LHSkSi) 

k=\ 


LE(S*S fc ) 
k= 1 


P{||Z|| > t}< [d\ + d 2 ) ■ exp 


-t z l 2 


v(Z) + Ltl3 


for all t > 0. 


E||Z|| < \/2v{Z) log(di + d 2 ) + -Llog(di +d 2 ). 


(1.6.4) 


(1.6.5) 


( 1 . 6 . 6 ) 


The proof of this result appears in Chapter 6. 

To appreciate what Theorem 1.6.2 means, it is valuable to make a direct comparison with the 
scalar version, Theorem 1.6.1. In both cases, we express the object of interest as an independent 
sum, and we instate a uniform bound on the summands. There are three salient changes: 


• The variance v[Z) in the result for matrices can be interpreted as the magnitude of the 
expected squared deviation of Z from its mean. The formula reflects the fact that a gen¬ 
eral matrix B has two different squares BB* and B*B. For an Hermitian matrix, the two 
squares coincide. 

• The tail bound has a dimensional factor d\ + d 2 that depends on the size of the matrix. This 
factor reduces to two in the scalar setting. In the matrix case, it limits the range of t where 
the tail bound is informative. 

• We have included a bound for E ||Z||. This estimate is not particularly interesting in the 
scalar setting, but it is usually quite challenging to prove results of this type for matrices. 
In fact, the expectation bound is often more useful than the tail bound. 
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The latter point deserves amplification: 

The expectation bound (1.6.6) is the most important aspect of the matrix Bern¬ 
stein inequality. 

For further discussion of this result, turn to Chapter 6. Chapters 4 and 7 contain related results 
and interpretations. 

1.6.3 Example: The Sample Covariance Estimator 

We will apply the matrix Bernstein inequality, Theorem 1.6.2, to measure how well the sample 
covariance estimator approximates the true covariance matrix. As before, let x be a zero-mean 
random vector with dimension p. Introduce the pxp covariance matrix A = E(xx *). Suppose we 
have n independent samples xi,...,x n with the same distribution as x. Form the pxp sample 
covariance estimator 

1 " 

Y= ~I >***■ 

11 k= 1 

Our goal is to study how the spectral-norm distance || Y - A|| between the sample covariance and 
the true covariance depends on the number n of samples. 

For simplicity, we will perform the analysis under the extra assumption that the 7 2 norm of 
the random vector is bounded: ||x|| 2 < B. This hypothesis can be relaxed if we apply a variant of 
the matrix Bernstein inequality that reflects the typical magnitude of a summand S^. One such 
variant appears in the formula (6.1.6). 

We are in a situation where it is quite easy to see how the matrix Bernstein inequality applies. 
Define the random deviation Z of the estimator Y from the true covariance matrix A: 

11 1 

Z - Y - A - Y St where S^ - — [x^x^ - A) for each index k. 
k= 1 n 

The random matrices Sj. are independent, identically distributed, and centered. To apply The¬ 
orem 1.6.2, we need to find a uniform bound L for the summands, and we need to control the 
matrix variance statistic v(Z). 

First, let us develop a uniform bound on the spectral norm of each summand. We may cal¬ 
culate that 

1 1 2.B 

IIS*II - - \\x k x* k -A\\ < -[\\x k x* k \\ + || All) < —. 

The first relation is the triangle inequality. The second follows from the assumption that x is 
bounded and the observation that 

IIA|| = ||E(xx*)|| < E||xx* || = E ||x|| 2 < B. 

This expression depends on Jensen’s inequality and the hypothesis that x is bounded. 

Second, we need to bound the matrix variance statistic v{Z) defined in (1.6.4). The matrix Z 
is Hermitian, so the two squares in this formula coincide with each other: 

t ™l ■ 

k=l 
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We need to determine the variance of each summand. By direct calculation, 


ES 2 k = 


-E[x k x* k -A) 2 


= -^2 E [ II Xfc || 2 -x k x* k - [x k xl) A - A [x k x * k ) + A 2 } 
<±[B-E(x k x* k )-A 2 -A 2 +A 2 } 



The expression H =4 T means that T - H is positive semidehnite. We used the norm bound for 
the random vector x and the fact that expectation preserves the semidehnite order. In the last 
step, we dropped the negative-semidefmite term -A 2 . Summing this relation over k, we reach 


o^£e s 2 4 


k=\ 



n 


The matrix is positive-semidehnite because it is a sum of squares of Hermitian matrices. Extract 
the spectral norm to arrive at 


v[Z) = 



Bll A|| 


n 


We have now collected the information we need to analyze the sample covariance estimator. 

We can invoke the estimate (1.6.6) from the matrix Bernstein inequality, Theorem 1.6.2, with 
the uniform bound L = 2 Bln and the variance bound v(Z) < B Mil In. We attain 


EIIF- A|| = E||Z|| < 


2 B || A|| log(2p) 


2£log(2p) 
3 n 


In other words, the error in approximating the sample covariance matrix is not too large when 
we have a sufficient number of samples. If we wish to obtain a relative error on the order of e, we 
may take 

2£log(2p) 

n > —--. 

£ 2 II All 

This selection yields 

E||F-A||<(£ + £ 2 )-||A||. 

It is often the case that B = Const- p, so we discover that n = Const ■ £ -2 plogp samples are suf¬ 
ficient for the sample covariance estimator to provide a relatively accurate estimate of the true 
covariance matrix A. This bound is qualitatively sharp for worst-case distributions. 

The analysis in this section applies to many other examples. We encapsulate the argument 
in Corollary 6.2.1, which we use to study several more problems. 


1.6.4 History of this Example 

Covariance estimation may be the earliest application of matrix concentration tools in random 
matrix theory. Rudelson [Rud99], building on a suggestion of Pisier, showed how to use the non- 
commutative Khintchine inequality [LP86, LPP91, BucOl, Buc05] to obtain essentially optimal 
bounds on the sample covariance estimator of a bounded random vector. The tutorial [Verl2] of 
Roman Vershynin offers an overview of this problem as well as many results and references. The 
analysis of the sample covariance matrix here is adapted from the technical [GT14]. It leads to a 
result similar with the one Rudelson obtained in [Rud99] . 
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1.6.5 Optimality of the Matrix Bernstein Inequality 

Theorem 1.6.2 can be sharpened very little because it applies to every random matrix Z of the 
form (1.6.3). Let us say a few words about optimality now, postponing the details to §6.1.2. 

Suppose that Z is a random matrix of the form (1.6.3). To make the comparison simpler, we 
also insist that each summand St is a symmetric random variable; that is, S/,- and -Sfc have the 
same distribution for each index 1c. Introduce the quantity 

L 2 =Emax it ||S fc || 2 . 

In §6.1.2, we will argue that these assumptions imply 
const - [ v(Z) + I'll < E||Z|| 2 

, , d-6.7) 

< Const- [v(Z)log(di + d, 2 ) + L*log [d\ + d, 2 )\. 

In other words, the scale of E ||Z[| 2 must depend on the matrix variance statistic u[Z) and the 
average upper bound Z, 2 for the summands. The quantity L = sup WS^W that appears in the matrix 
Bernstein inequality always exceeds L*, sometimes by a large margin, but they capture the same 
type of information. 

The significant difference between the lower and upper bound in (1.6.7) comes from the di¬ 
mensional factor log(d| + d 2 ). There are random matrices Z for which the lower bound gives a 
more accurate reflection of E ||Z|| 2 , but there are also many random matrices where the upper 
bound describes the behavior correctly. At present, there is no method known for distinguishing 
between these two extremes under the model (1.6.3) for the random matrix. 

The tail bound (1.6.5) provides a useful tool in practice, but it is not necessarily the best way 
to collect information about large deviation probabilities. To obtain more precise results, we rec¬ 
ommend using the expectation bound (1.6.6) to control E \\Z\\ and then applying scalar concen¬ 
tration inequalities to estimate P {|| Z || > E||Z|| + t}. The book [BLM13] offers a good treatment of 
the methods that are available for establishing scalar concentration. 

1.7 The Arsenal of Results 

The Bernstein inequality is probably the most familiar exponential tail bound for a sum of in¬ 
dependent random variables, but there are many more. It turns out that essentially all of these 
scalar results admit extensions that hold for random matrices. In fact, many of the established 
techniques for scalar concentration have analogs in the matrix setting. 

1.7.1 What’s Here... 

This monograph focuses on a few key exponential concentration inequalities for a sum of inde¬ 
pendent random matrices, and it describes some specific applications of these results. 

Matrix Gaussian Series. A matrix Gaussian series is a random matrix that can be expressed as 
a sum of fixed matrices, each weighted by an independent standard normal random vari¬ 
able. This formulation includes a surprising number of examples. The most important 
are undoubtedly Wigner matrices and rectangular Gaussian matrices. Another interest¬ 
ing case is a Toeplitz matrix with Gaussian entries. The analysis of matrix Gaussian series 
appears in Chapter 4. 
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Matrix Rademacher Series. A matrix Rademacher series is a random matrix that can be writ¬ 
ten as a sum of fixed matrices, each weighted by an independent Rademacher random 
variable. 3 This construction includes things like random sign matrices, as well as a fixed 
matrix whose entries are modulated by random signs. There are also interesting examples 
that arise in combinatorial optimization. We treat these problems in Chapter 4. 

Matrix Chernoff Bounds. The matrix Chernoff bounds apply to a random matrix that can be de¬ 
composed as a sum of independent, random positive-semidehnite matrices whose maxi¬ 
mum eigenvalues are subject to a uniform bound. These results allow us to obtain infor¬ 
mation about the norm of a random submatrix drawn from a fixed matrix. They are also 
appropriate for studying the Laplacian matrix of a random graph. See Chapter 5. 

Matrix Bernstein Bounds. The matrix Bernstein inequality concerns a random matrix that can 
be expressed as a sum of independent, centered random matrices that admit a uniform 
spectral-norm bound. This result has many applications, including the analysis of ran¬ 
domized algorithms for matrix sparsifrcation and matrix multiplication. It can also be used 
to study the random features paradigm for approximating a kernel matrix. Chapter 6 con¬ 
tains this material. 

Intrinsic Dimension Bounds. Some matrix concentration inequalities can be improved when 
the random matrix has limited spectral content in most dimensions. In this situation, we 
may be able to obtain bounds that do not depend on the ambient dimension. See Chap¬ 
ter 7 for details. 

We have chosen to present these results because they are illustrative, and they have already 
found concrete applications. 

1.7.2 What’s Not Here... 

The program of extending scalar concentration results to the matrix setting has been quite fruit¬ 
ful, and there are many useful results beyond the ones that we detail. Let us mention some of the 
other tools that are available. For further information, see the annotated bibliography. 

First, there are additional exponential concentration inequalities for a sum of independent 
random matrices. All of the following results can be established within the framework of this 
monograph. 

• Matrix Hoeffding. This result concerns a sum of independent random matrices whose 
squares are subject to semidefrnite upper bounds [Trollc, §7]. 

• Matrix Bennett. This estimate sharpens the tail bound from the matrix Bernstein inequal¬ 
ity [Trollc, §6]. 

• Matrix Bernstein, Unbounded Case. The matrix Bernstein inequality extends to the case 
where the moments of the summands grow at a controlled rate. See [Trol lc, §6] or [Koll 1] . 

• Matrix Bernstein, Nonnegative Summands. The lower tail of the Bernstein inequality can 
be improved when the summands are positive semidefinite [Mau03] ; this result extends to 
the matrix setting. By a different argument, the dimensional factor can be removed from 
this bound for a class of interesting examples [01il3, Thm. 3.1]. 

3 A Rademacher random variable takes the two values ± 1 with equal probability. 
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The approach in this monograph can be adapted to obtain exponential concentration for 
matrix-valued martingales. Here are a few results from this category: 

• Matrix Azuma. This is the martingale version of the matrix Hoeffding bound [Trollc, §7]. 

• Matrix Bounded Differences. The matrix Azuma inequality gives bounds for the spectral 
norm of a matrix-valued function of independent random variables [Trol lc, §7]. 

• Matrix Freedman. This result can be viewed as the martingale extension of the matrix 
Bernstein inequality [OlilOa, Trol la]. 

The technical report [Trollb] explains how to extend other bounds for a sum of independent 
random matrices to the martingale setting. 

Polynomial moment inequalities provide bounds for the expected trace of a power of a ran¬ 
dom matrix. Moment inequalities for a sum of independent random matrices can provide useful 
information when the summands have heavy tails or else a uniform bound does not reflect the 
typical size of the summands. 

• Matrix Khintchine. The matrix Khintchine inequality is the polynomial version of the ex¬ 
ponential bounds for matrix Gaussian series and matrix Rademacher series. This result is 
presented in (4.7.1). See the papers [LP86, BucOl, Buc05] or [MJC + 14, Cor. 7.3] for proofs. 

• Matrix Moment Inequalities. The matrix Chernoff inequality admits a polynomial variant; 
the simplest form appears in (5.1.9) . The matrix Bernstein inequality also has a polynomial 
variant, stated in (6.1.6). These bounds are drawn from [CGT12a, App.]. 

The methods that lead to polynomial moment inequalities differ substantially from the tech¬ 
niques in this monograph, so we cannot include the proofs here. The annotated bibliography 
includes references to the large literature on moment inequalities for random matrices. 

Recently, Lester Mackey and the author, in collaboration with Daniel Paulin and several other 
researchers [MJC + 14, PMT14], have developed another framework for establishing matrix con¬ 
centration. This approach extends a scalar argument, introduced by Chatterjee [Cha05, Cha07], 
that depends on exchangeable pairs and Markov chain couplings. The method of exchangeable 
pairs delivers both exponential concentration inequalities and polynomial moment inequalities 
for random matrices, and it can reproduce many of the bounds mentioned above. It also leads 
to new results: 

• Polynomial Efron-Stein Inequality for Matrices. This bound is a matrix version of the 
polynomial Efron-Stein inequality [BBLM05, Thm. 1]. It controls the polynomial moments 
of a centered random matrix that is a function of independent random variables [PMT14, 
Thm. 4.2], 

• Exponential Efron-Stein Inequality for Matrices. This bound is the matrix extension of 
the exponential Efron-Stein inequality [BLM03, Thm. 1]. It leads to exponential concen¬ 
tration inequalities for a centered random matrix constructed from independent random 
variables [PMT14, Thm. 4.3]. 
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Another significant advantage is that the method of exchangeable pairs can sometimes handle 
random matrices built from dependent random variables. Although the simplest version of the 
exchangeable pairs argument is more elementary than the approach in this monograph, it takes 
a lot of effort to establish the more useful inequalities. With some regret, we have chosen not to 
include this material because the method and results are accessible to a narrower audience. 

Finally, we remark that the modified logarithmic Sobolev inequalities of [BLM03, BBLM05] 
also extend to the matrix setting [CT14] . Unfortunately, the matrix variants do not seem to be as 
useful as the scalar results. 

1.8 About This Monograph 

This monograph is intended for graduate students and researchers in computational mathemat¬ 
ics who want to learn some modern techniques for analyzing random matrices. The preparation 
required is minimal. We assume familiarity with calculus, applied linear algebra, the basic the¬ 
ory of normed spaces, and classical probability theory up through the elementary concentration 
inequalities (such as Markov and Bernstein). Beyond the basics, which can be gleaned from any 
good textbook, we include all the required background in Chapter 2. 

The material here is based primarily on the paper “User-Friendly Tail Bounds for Sums of 
Random Matrices” by the present author [Trollc]. There are several significant revisions to this 
earlier work: 

Examples and Applications. Many of the papers on matrix concentration give limited informa¬ 
tion about how the results can be used to solve problems of interest. A major part of these 
notes consists of worked examples and applications that indicate how matrix concentra¬ 
tion inequalities apply to practical questions. 

Expectation Bounds. This work collects bounds for the expected value of the spectral norm of 
a random matrix and bounds for the expectation of the smallest and largest eigenvalues of 
a random symmetric matrix. Some of these useful results have appeared piecemeal in the 
literature [CGT12a, MJC + 14], but they have not been included in a unified presentation. 

Optimality. We explain why each matrix concentration inequality is (nearly) optimal. This pre¬ 
sentation includes examples to show that each term in each bound is necessary to describe 
some particular phenomenon. 

Intrinsic Dimension Bounds. Over the last few years, there have been some refinements to the 
basic matrix concentration bounds that improve the dependence on dimension [HKZ12, 
Mini 1]. We describe a new framework that allows us to prove these results with ease. 

Lieb’s Theorem. The matrix concentration inequalities in this monograph depend on a deep 
theorem [Lie73, Thm. 6] from matrix analysis due to Elliott Lieb. We provide a complete 
proof of this result, along with all the background required to understand the argument. 

Annotated Bibliography. We have included a list of the major works on matrix concentration, 
including a short summary of the main contributions of these papers. We hope this catalog 
will be a valuable guide for further reading. 

The organization of the notes is straightforward. Chapter 2 contains background material 
that is needed for the analysis. Chapter 3 describes the framework for developing exponential 
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concentration inequalities for matrices. Chapter 4 presents the first set of results and examples, 
concerning matrix Gaussian and Rademacher series. Chapter 5 introduces the matrix Chernoff 
bounds and their applications, and Chapter 6 expands on our discussion of the matrix Bernstein 
inequality. Chapter 7 shows how to sharpen some of the results so that they depend on an in¬ 
trinsic dimension parameter. Chapter 8 contains the proof of Lieb’s theorem. We conclude with 
resources on matrix concentration and a bibliography. 

To make the presentation smoother, we have not followed all of the conventions for scholarly 
articles in journals. In particular, almost all the citations appear in the notes at the end of each 
chapter. Our aim has been to explain the ideas as clearly as possible, rather than to interrupt the 
narrative with an elaborate genealogy of results. 




Matrix Functions & 
Probability with Matrices 


We begin the main development with a short overview of the background material that is re¬ 
quired to understand the proofs and, to a lesser extent, the statements of matrix concentration 
inequalities. We have been careful to provide cross-references to these foundational results, so 
most readers will be able to proceed directly to the main theoretical development in Chapter 3 
or the discussion of specific random matrix inequalities in Chapters 4, 5, and 6. 


Overview 

Section 2.1 covers material from matrix theory concerning the behavior of matrix functions. Sec¬ 
tion 2.2 reviews relevant results from probability, especially the parts involving matrices. 

2.1 Matrix Theory Background 

Let us begin with the results we require from the field of matrix analysis. 

2.1.1 Conventions 

We write R and C for the real and complex fields. A matrix is a finite, two-dimensional array of 
complex numbers. Many parts of the discussion do not depend on the size of a matrix, so we 
specify dimensions only when it really matters. Readers who wish to think about real-valued 
matrices will find that none of the results require any essential modification in this setting. 

2.1.2 Spaces of Vectors 

The symbol C d denotes the complex linear space consisting of d-dimensional column vectors 
with complex entries, equipped with the usual componentwise addition and multiplication by a 
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complex scalar. We endow this space with the standard £ 2 inner product 

d 

(x, y) = x*y=Y, x i yi for all X,ye£ d . 

i =1 

The symbol * denotes the complex conjugate of a number, as well as the conjugate transpose of 
a vector or matrix. The inner product induces the £ 2 norm: 

d 

||x|| 2 = {x, x) = \ x i\ 2 for all x e C d . (2.1.1) 

/= 1 

Similarly, the real linear space IR rf consists of d-dimensional column vectors with real entries, 
equipped with the usual componentwise addition and multiplication by a real scalar. The inner 
product and £2 norm on R' / are defined by the same relations as for C d . 

2.1.3 Spaces of Matrices 

We write M dl * dl for the complex linear space consisting of d\ x d 2 matrices with complex entries, 
equipped with the usual componentwise addition and multiplication by a complex scalar. It is 
convenient to identify C d with the space W rfxl . We write M ( { for the algebra of d x d square, 
complex matrices. The term “algebra” just means that we can multiply two matrices in M r i to 
obtain another matrix in 

2.1.4 Topology & Convergence 

We can endow the space of matrices with the Frobenius norm: 

d\ d .2 

[|B|| 2 = E E l fo ;fcl 2 for BeM dlxdl . (2.1.2) 

7=1 *=1 

Observe that the Frobenius norm on y dx 1 coincides with the £ 2 norm (2.1.1) on C rf . 

The Frobenius norm induces a norm topology on the space of matrices. In particular, given 
a sequence {B n :n = 1,2,3,...} c M dixd ' z , the symbol 

B n —■ B means that \\B n - B|| F —>■ 0 asn—>-oo. 

Open and closed sets are also defined with respect to the Frobenius-norm topology. Every other 
norm topology on M dl * d2 induces the same notions of convergence and open sets. We use the 
same topology for the normed linear spaces C d and M^. 

2.1.5 Basic Vectors and Matrices 

We write 0 for the zero vector or the zero matrix, while I denotes the identity matrix. Occasionally, 
we add a subscript to specify the dimension. For instance, I,/ is the dx d identity. 

The standard basis for the linear space C d consists of standard basis vectors. The standard 
basis vector is a column vector with a one in position k and zeros elsewhere. We also write e 
for the column vector whose entries all equal one. There is a related notation for the standard 
basis of M rf| x d2 . We write Efor the standard basis matrix with a one in position ( /, k) and zeros 
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elsewhere. The dimension of a standard basis vector and a standard basis matrix is typically 
determined by the context. 

A square matrix Q that satisfies QQ* - I = Q*Q is called a unitary matrix. We reserve the 
letter Q for a unitary matrix. Readers who prefer the real setting may prefer to regard Q as an 
orthogonal matrix. 

2.1.6 Hermitian Matrices and Eigenvalues 

An Hermitian matrix A is a square matrix that satisfies A - A* . A useful intuition from oper¬ 
ator theory is that Hermitian matrices are analogous with real numbers, while general square 
matrices are analogous with complex numbers. 

We write IIL/ for the collection of dx d Hermitian matrices. The set IH f / is a linear space over 
the real field. That is, we can add Hermitian matrices and multiply them by real numbers. The 
space Hd inherits the Frobenius-norm topology from M f /. We adopt Parlett’s convention [Par98] 
that bold Latin and Greek letters that are symmetric around the vertical axis [A, H,... , Y; A, 0, 

..., O) always represent Hermitian matrices. 

Each Hermitian matrix A e has an eigenvalue decomposition 

A - QAQ* where Q e M r j is unitary and A e Hrf is diagonal. (2.1.3) 

The diagonal entries of A are real numbers, which are referred to as the eigenvalues of A. The 
unitary matrix Q in the eigenvalue decomposition is not determined completely, but the list of 
eigenvalues is unique modulo permutations. The eigenvalues of an Hermitian matrix are often 
referred to as its spectrum. 

We denote the algebraic minimum and maximum eigenvalues of an Hermitian matrix A by 
A m j n (A) and Aniax(A). The extreme eigenvalue maps are positive homogeneous: 

'^min(trA) = U/l ln j M (A) and A max (ttA) = uA m;ix (A) for a ’ (I. (2.1.4) 

There is an important relationship between minimum and maximum eigenvalues: 

Amin (—A) = -A max (A). (2.1.5) 

The fact (2.1.5) warns us that we must be careful passing scalars through an eigenvalue map. 

This work rarely requires any eigenvalues of an Hermitian matrix aside from the minimum 
and maximum. When they do arise, we usually order the other eigenvalues in the weakly de¬ 
creasing sense: 

A 1 (A)>A 2 (A)>-->A <i (A) forAeH d . 

On occasion, it is more natural to arrange eigenvalues in the weakly increasing sense: 

A[(A)<A!,(A)<---<A^(A) forAeH d . 

To prevent confusion, we will accompany this notation with a reminder. 

Readers who prefer the real setting may read “symmetric” in place of “Hermitian.” In this 
case, the eigenvalue decomposition involves an orthogonal matrix Q. Note, however, that the 
term “symmetric” has a different meaning when applied to random variables! 
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2.1.7 The Trace of a Square Matrix 

The trace of a square matrix, denoted by tr, is the sum of its diagonal entries. 

d 

tr B=Y. bjj for B e Mrf. (2.1.6) 

7=1 

The trace is unitarily invariant: 

trB = tr(QBQ*) for each B e and each unitary Q e IVO^. (2.1.7) 

In particular, the existence of an eigenvalue decomposition (2.1.3) shows that the trace of an 
Hermitian matrix equals the sum of its eigenvalues. 1 

Another valuable relation connects the trace with the Frobenius norm: 

||C||p = tr(CC*) = tr(C*C). for all C e W dlx . (2.1.8) 

This expression follows from the definitions (2.1.2) and (2.1.6) and a short calculation. 

2.1.8 The Semidefinite Partial Order 

A matrix A e is positive semidefinite when it satisfies 

u*Au>0 for each vector u e C d . (2.1.9) 

Equivalently, a matrix A is positive semidefinite when it is Hermitian and its eigenvalues are all 
nonnegative. Similarly, we say that A e H,j is positive definite when 

u*Au>0 for each nonzero vector u e C d . (2.1.10) 

Equivalently, A is positive definite when it is Hermitian and its eigenvalues are all positive. 

Positive-semidefinite and positive-definite matrices play a special role in matrix theory, anal¬ 
ogous with the role of nonnegative and positive numbers in real analysis. In particular, observe 
that the square of an Hermitian matrix is always positive semidefinite. The square of a nonsin¬ 
gular Hermitian matrix is always positive definite. 

The family of positive-semidefinite matrices in IH ( / forms a closed convex cone. 2 This geo¬ 
metric fact follows easily from the definition (2.1.9). Indeed, for each vector u e C d , the condition 

{Ae Hrf : u* Au > 0} 

describes a closed halfspace in H^. As a consequence, the family of positive-semidefinite matri¬ 
ces in Hd is an intersection of closed halfspaces. Therefore, it is a closed convex set. To see why 
this convex set is a cone, just note that 

A positive semidefinite implies a A is positive semidefinite for a > 0. 

1 I bis fact also holds true for a general square matrix. 

2 A con vex cone is a subset C of a linear space that is closed under conic combinations. That is, t\x\ Ai2 x 2 e C for all 
x\,X 2 £ C and all tj,T 2 > 0. Equivalently, C is a set that is both convex and positively homogeneous. 
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Beginning from (2.1.10) , similar considerations show that the family of positive-definite matrices 
in H r j forms an (open) convex cone. 

We may now define the semidefinite partial order =4 on the real-linear space H t j using the rule 

A^.H if and only if H - A is positive semidefinite. (2.1.11) 

In particular, we write A ;>= 0 to indicate that A is positive semidefinite and A > 0 to indicate that 
A is positive definite. For a diagonal matrix A, the expression A :>= 0 means that each entry of A 
is nonnegative. 

The semidefinite order is preserved by conjugation, a simple fact whose importance cannot 
be overstated. 

Proposition 2.1.1 (Conjugation Rule). Let A andH be Hermitian matrices of the same dimension, 
and letB be a general matrix with compatible dimensions. Then 

AfiH implies BAB* =4 BHB*. (2.1.12) 

Finally, we remark that the trace of a positive-semidefinite matrix is at least as large as its 
maximum eigenvalue: 


Amax(d) ^ tr A when A is positive semidefinite. (2.1.13) 

This property follows from the definition of a positive-semidefinite matrix and the fact that the 
trace of A equals the sum of the eigenvalues. 

2.1.9 Standard Matrix Functions 

Let us describe the most direct method for extending a function on the real numbers to a func¬ 
tion on Hermitian matrices. The basic idea is to apply the function to each eigenvalue of the 
matrix to construct a new matrix. 

Definition 2.1.2 (Standard Matrix Function). Let f : I — * IR where I is an interval of the real line. 
Consider a matrix A eHrf whose eigenvalues are contained in I. Define the matrix f (A) e [HI,/ using 
an eigenvalue decomposition of A: 



7(Ai) 



Ai 


f{A) = Q 


fUd) 

Q* where A - Q 




In particular, we can apply f to a real diagonal matrix by applying the function to each diagonal 
entry. 

It can be verified that the definition of /(A) does not depend on which eigenvalue decomposi¬ 
tion A - QAQ* that we choose. Any matrix function that arises in this fashion is called a standard 
matrix function. 

To confirm that this definition is sensible, consider the power function /(f) = t' 1 for a natural 
number q. When A is Hermitian, the power function /(A) = A' 1 , where A q is the g-fold product 
of A. 

For an Hermitian matrix A, whenever we write the power function A' 1 or the exponential e A 
or the logarithm log A, we are always referring to a standard matrix function. Note that we only 
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define the matrix logarithm for positive-definite matrices, and non-integer powers are only valid 
for positive-semidefinite matrices. 

The following result is an immediate, but important, consequence of the definition of a stan¬ 
dard matrix function. 

Proposition 2.1.3 (Spectral Mapping Theorem). Let f : I —► 05 be a function on an interval I of 
the real line, and let A be an Hermitian matrix whose eigenvalues are contained in I. If A is an 
eigenvalue of A, then /(A) is an eigenvalue off (A). 

When a real function has a power series expansion, we can also represent the standard matrix 
function with the same power series expansion. Indeed, suppose that : / —► IR is defined on an 
interval I of the real line, and assume that the eigenvalues of A are contained in I. Then 

oo oo 

f{a) — Co + Cqa q for a e I implies /(A) = col + E c <?^- 

<7=1 <7=1 

This formula can be verified using an eigenvalue decomposition of A and the definition of a 
standard matrix function. 

2.1.10 The Transfer Rule 

In most cases, the “obvious” generalization of an inequality for real-valued functions fails to hold 
in the semidefinite order. Nevertheless, there is one class of inequalities for real functions that 
extends to give semidefinite relationships for standard matrix functions. 

Proposition 2.1.4 (Transfer Rule). Let f and g be real-valued functions defined on an interval I 
of the real line, and let A be an Hermitian matrix whose eigenvalues are contained in I. Then 

f(a)<g[a) for each ae I implies /(A)^g(A). (2.1.14) 

Proof. Decompose A = QAQ* . It is immediate that /(A) =<: g(A). The Conjugation Rule (2.1.12) 
allows us to conjugate this relation by Q. Finally, we invoke Definition 2.1.2, of a standard matrix 
function, to complete the argument. □ 

2.1.11 The Matrix Exponential 

For any Hermitian matrix A, we can introduce the matrix exponential e A using Definition 2.1.2. 
Equivalently, we can use a power series expansion: 

oo 

e A = exp(A) = I + £— . (2.1.15) 

q= 1 c l- 

The Spectral Mapping Theorem, Proposition 2.1.3, implies that the exponential of an Hermitian 
matrix is always positive definite. 

We often work with the trace of the matrix exponential: 

trexp : A< —* tre A . 

This function has a monotonicity property that we use extensively. For Hermitian matrices A 
and H with the same dimension, 

A^.H implies tre A <tre H . (2.1.16) 

We establish this result in §8.3.2. 
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2.1.12 The Matrix Logarithm 

We can define the matrix logarithm as a standard matrix function. The matrix logarithm is also 
the functional inverse of the matrix exponential: 

log(e / ') = A for each Hermitian matrix A. (2.1.17) 

A valuable fact about the matrix logarithm is that it preserves the semidefinite order. For positive- 
definite matrices A and H with the same dimension, 

A^-H implies logA^logH. (2.1.18) 

We establish this result in §8.4.4. Let us stress that the matrix exponential does not have any 
operator monotonicity property analogous with (2.1.18)! 

2.1.13 Singular Values of Rectangular Matrices 

A general matrix does not have an eigenvalue decomposition, but it admits a different represen¬ 
tation that is just as useful. Every d\ x d 2 matrix B has a singular value decomposition 

B = QiZQ* where Q\ and Q 2 are unitary and 2 is nonnegative diagonal. (2.1.19) 

The unitary matrices Qi and Q 2 have dimensions d\ x d\ and d 2 x d 2 , respectively. The inner 
matrix 2 has dimension d\*d 2} and we use the term diagonal in the sense that only the diagonal 
entries (2 )jj maybe nonzero. 

The diagonal entries of 2 are called the singular values of B, and they are denoted as cr j IB). 
The singular values are determined completely modulo permutations, and it is conventional to 
arrange them in weakly decreasing order: 


ai IB) > o 2 {B) > • • • > ermine, d 2 ] 


There is an important relationship between singular values and eigenvalues. A general matrix 
has two squares associated with it, BB* and B*B, both of which are positive semidefinite. We 
can use a singular value decomposition of B to construct eigenvalue decompositions of the two 
squares: 

BB* = Qi(22*)Qj and B* B - Q 2 (Z*Z)Q 2 (2.1.20) 

The two squares of 2 are square, diagonal matrices with nonnegative entries. Conversely, we can 
always extract a singular value decomposition from the eigenvalue decompositions of the two 
squares. 

We can write the Frobenius norm of a matrix in terms of the singular values: 

mm{di,d, 2 } 

|| JB ||p = Y, for BeM dlxd2 . (2.1.21) 

l=i 

This expression follows from the expression (2.1.8) for the Frobenius norm, the property (2.1.20) 
of the singular value decomposition, and the unitary invariance (2.1.7) of the trace. 
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2.1.14 The Spectral Norm 

The spectral norm of an Hermitian matrix A is defined by the relation 


||A|| = max{A max (A), -A min (A)}. 


( 2 . 1 . 22 ) 


For a general matrix B, the spectral norm is defined to be the largest singular value: 


||B|| =<n(B). 


(2.1.23) 


These two definitions are consistent for Hermitian matrices because of (2.1.20). When applied 
to a row vector or a column vector, the spectral norm coincides with the £2 norm (2.1.1). 

We will often need the fact that 


B|| 2 =||BB*|| = ||B*B 


(2.1.24) 


This identity also follows from (2.1.20). 

2.1.15 The Stable Rank 

In several of the applications, we need an analytic measure of the collinearity of the rows and 
columns of a matrix called the stable rank. For a general matrix B, the stable rank is defined as 

II-Blip 

srank(B) =-(2.1.25) 

IIBII 2 

The stable rank is a lower bound for the algebraic rank: 

1 < srank(B) < rank(B). 

This point follows when we use (2.1.21) and (2.1.23) to express the two norms in terms of the 
singular values of B. In contrast to the algebraic rank, the stable rank is a continuous function of 
the matrix, so it is more suitable for numerical applications. 

2.1.16 Dilations 

An extraordinarily fruitful idea from operator theory is to embed matrices within larger block 
matrices, called dilations. Dilations have an almost magical power. In this work, we will use dila¬ 
tions to extend matrix concentration inequalities from Hermitian matrices to general matrices. 

Definition 2.1.5 (Hermitian Dilation). The Hermitian dilation 


^:M dlX4fe — U dl+d2 


is the map from a general matrix to an Hermitian matrix defined by 



(2.1.26) 
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It is clear that the Hermitian dilation is a real-linear map. Furthermore, the dilation retains 
important spectral information. To see why, note that the square of the dilation satisfies 



(2.1.27) 


We discover that the squared eigenvalues of coincide with the squared singular values of 

B, along with an appropriate number of zeros. As a consequence, || J^(B)|| — ||B||. Moreover, 


^max = \\3?{B)\\ = IIBII. 


(2.1.28) 


We will invoke the identity (2.1.28) repeatedly. 

One way to justify the first relation in (2.1.28) is to introduce the first columns u\ and u 2 of 
the unitary matrices Qi and Q 2 that appear in the singular value decomposition B = Q\ ZQj. 
Then we may calculate that 


||B||=Re(M*BM 2 )=M“ 1 l [° JJl M <A max (^(B))<||^(B)|| = ||B||. 

2 M2 i) u M2 


Indeed, the spectral norm of B equals its largest singular value oqtB), which coincides with 
rtj Bu 2 by construction of U\ and u 2 . The second identity relies on a direct calculation. The first 
inequality follows from the variational representation of the maximum eigenvalue as a Rayleigh 
quotient; this fact can also be derived as a consequence of (2.1.3). The second inequality de¬ 
pends on the definition (2.1.22) of the spectral norm of an Hermitian matrix. 

2.1.17 Other Matrix Norms 

There are a number of other matrix norms that arise sporadically in this work. The Schatten 
1 -norm of a matrix can be defined as the sum of its singular values: 


imiiqm ,1121 

IIB || Si = £ for BeM dlxd2 

7=1 


min{^i,^2} 


(2.1.29) 


The entrywise f.\ norm of a matrix is defined as 



(2.1.30) 


7=n-=i 


We always have the relation 


[|B|| fl < \J d\d 2 ||B|| f for B e M dl * d2 


(2.1.31) 


because of the Cauchy-Schwarz inequality. 


2.2 Probability with Matrices 

We continue with some material from probability, focusing on connections with matrices. 
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2.2.1 Conventions 

We prefer to avoid abstraction and unnecessary technical detail, so we frame the standing as¬ 
sumption that all random variables are sufficiently regular that we are justified in computing 
expectations, interchanging limits, and so forth. The manipulations we perform are valid if we 
assume that all random variables are bounded, but the results hold in broader circumstances if 
we instate appropriate regularity conditions. 

Since the expectation operator is linear, we typically do not use parentheses with it. We in¬ 
state the convention that powers and products take precedence over the expectation operator. 
In particular, 

EX q = E(X q ). 

This position helps us reduce the clutter of parentheses. We sometimes include extra delimiters 
when it is helpful for clarity. 

2.2.2 Some Scalar Random Variables 

We use consistent notation for some of the basic scalar random variables. 

Standard normal variables. We reserve the letter y for a normal(0, 1) random variable. That is, 
y is a real Gaussian with mean zero and variance one. 

Rademacher random variables. We reserve the letter g for a random variable that takes the two 
values ± 1 with equal probability. 

Bernoulli random variables. A Bernoulli (p) random variable takes the value one with proba¬ 
bility p and the value zero with probability 1 — p, where p e [0,1]. We use the letters S and 
£ for Bernoulli random variables. 

2.2.3 Random Matrices 

Let (Q, &, P) be a probability space. A random matrix Z is a measurable map 

Z: L2— >M dixd2 . 

ft is more natural to think of the entries of Z as complex random variables that may or may not 
be correlated with each other. We reserve the letters X and Y for random Hermitian matrices, 
while the letter Z denotes a general random matrix. 

A finite sequence {Z k } of random matrices is independent when 

P {Z k e F k for each k } = ]~Ifc ^ '' Z k E F k) 

for every collection { F k } of Borel subsets of fi# /] " dl . 

2.2.4 Expectation 

The expectation of a random matrix Z = [Zj k ] is simply the matrix formed by taking the compo¬ 
nentwise expectation. That is, 


(EZ)yfc = EZj k . 
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Under mild assumptions, expectation commutes with linear and real-linear maps. Indeed, ex¬ 
pectation commutes with multiplication by a fixed matrix: 

E (BZ) = B (EZ) and E(ZB) = (EZ) B. 

In particular, the product rule for the expectation of independent random variables extends to 
matrices: 

E(SZ) = (ES)(EZ) when S and Z are independent. 

We use these identities liberally, without any further comment. 

2.2.5 Inequalities for Expectation 

Markov’s inequality states that a nonnegative (real) random variable X obeys the probability 
bound 

EX 

IP{X>f}< — for t > 0. (2.2.1) 

The Markov inequality is a central tool for establishing concentration inequalities. 

Jensen’s inequality describes how averaging interacts with convexity. Let Z be a random ma¬ 
trix, and let h be a real-valued function on matrices. Then 

Eh(Z)<h(EZ) when h is concave, and 
Eh(Z)>h(EZ) when h is convex. 

The family of positive-semidefinite matrices in forms a convex cone, and the expectation 
of a random matrix can be viewed as a convex combination. Therefore, expectation preserves 
the semidefinite order: 

X^F implies EX=^EF. 

We use this result many times without direct reference. 

2.2.6 The Variance of a Random Hermitian Matrix 

The variance of a real random variable F is defined as the expected squared deviation from the 
mean: 

Var(F) = E(F-EF) 2 

There are a number of natural extensions of this concept in the matrix setting that play a role in 
our theory. 

Suppose that F is a random Hermitian matrix. We can define a matrix-valued variance: 

Var(F) = E (F - E F) 2 = E F 2 - (E F) 2 . (2.2.3) 

The matrix Var(F) is always positive semidefinite. We can interpret the ( j, k ) entry of this matrix 
as the covariance between the jth and /cth columns of F: 


(Var(F)) ifc = E [ ( y :] - Ey. j) * ( y :k - Ey :fc )], 


where we have written y-j for the jth column of F. 
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The matrix-valued variance contains a lot of information about the fluctuations of the ran¬ 
dom matrix. We can summarize Var(F) using a single number u{Y), which we call the matrix 
variance statistic: 

v[Y) = ||Var(F)|| - ||E(F-EF) 2 ||. (2.2.4) 

To understand what this quantity means, one may wish to rewrite it as 

i/(F) = sup E||(Fm)-E(Fm)|| 2 . 

Il»ll=i 

Roughly speaking, the matrix variance statistic describes the maximum variance of Fit for any 
unit vector u. 

2.2.7 The Variance of a Sum of Independent, Random Hermitian Matrices 

The matrix-valued variance interacts beautifully with a sum of independent random matrices. 
Consider a finite sequence {X k } of independent, random Hermitian matrices with common di¬ 
mension d. Introduce the sum F = X/t-Xfc. Then 

Var(F) = Var (£ k X k ) = E (£ fc (A* - E X k )f 

= Zj,k E ^ x J- EX i^ x k- E x k )] 

= £ fc E(* fc -E* fc ) 2 

= £ fc Var(X fc ). (2.2.5) 

This identity matches the familiar result for the variance of a sum of independent scalar random 
variables. It follows that the matrix variance statistic satisfies 

^ y HILfc Var W|- (2.2.6) 

The fact that the sum remains inside the norm is very important. Indeed, the best general in¬ 
equalities between v(Y) and the matrix variance statistics v[X k ) of the summands are 

v{Y)<Y jk yiX k )<d-v(Y). 

These relations can be improved in some special cases. For example, when the matrices X k are 
identically distributed, the left-hand inequality becomes an identity. 

2.2.8 The Variance of a Rectangular Random Matrix 

We will often work with non-Hermitian random matrices. In this case, we need to account for 
the fact that a general matrix has two different squares. Suppose that Z is a random matrix with 
dimension d\ x d 2 . Define 

Vari(Z) = E[(Z-EZ)(Z-EZ)*1, and 

(2.2.7) 

Var 2 (Z) = E[(Z-EZ)*(Z-EZ)]. 

The matrix Vari(Z) is a positive-semidefinite matrix with dimension d\ xd\, and it describes the 
fluctuation of the rows of Z. The matrix Var 2 (Z) is a positive-semidefinite matrix with dimension 
d 2 x d 2 , and it reflects the fluctuation of the columns of Z. For an Hermitian random matrix F, 


Var(F) = Vari(F) = Var 2 (F). 
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In other words, the two variances coincide in the Hermitian setting. 

As before, it is valuable to reduce these matrix-valued variances to a single scalar parameter. 
We define the matrix variance statistic of a general random matrix Z as 

v(Z) = max] ||Vari(Z)||, ||Var 2 (Z)|| }. (2.2.8) 


When Z is Hermitian, the definition (2.2.8) coincides with the original definition (2.2.4). 

To promote a deeper appreciation for the formula (2.2.8), let us explain how it arises from the 
Hermitian dilation (2.1.26). By direct calculation, 


Var (^T(Z)) = E 
= E 


0 (Z-EZ) 

(Z-EZ)* 0 

(Z-EZ) (Z-EZ)* 0 

0 (Z-EZ)*(Z-EZ) 


Var, (Z) 0 

0 Var 2 (Z) 


(2.2.9) 


The first identity is the definition (2.2.3) of the matrix-valued variance. The second line follows 
from the formula (2.1.27) for the square of the dilation. The last identity depends on the defini¬ 
tion (2.2.7) of the two matrix-valued variances. Therefore, using the definitions (2.2.4) and (2.2.8) 
of the matrix variance statistics, 


tAJ^(Z)) - IIVar(^(Z)|| = max{||Van(Z)||, ||Var 2 (Z)|| } = v{Z). (2.2.10) 

The second identity holds because the spectral norm of a block-diagonal matrix is the maximum 
norm achieved by one of the diagonal blocks. 


2.2.9 The Variance of a Sum of Independent Random Matrices 

As in the Hermitian case, the matrix-valued variances interact nicely with an independent sum. 
Consider a finite sequence {S^} of independent random matrices with the same dimension. 
Form the sum Z-Y.k^k- Repeating the calculation leading up to (2.2.6), we find that 

Var 1 (Z) = ^ jt Vari(S jfc ) and Var 2 (Z) = ^ fc Var 2 (S fc ). 

In summary, the matrix variance statistic of an independent sum satisfies 

i/(Z)=max{||X jfc Var 1 (S*)|| 1 ||£ fc Var 2 (S fc ) ||} ■ (2.2.11) 

This formula arises time after time. 


2.3 Notes 

Everything in this chapter is firmly established. We have culled the results that are relevant to our 
discussion. Let us give some additional references for readers who would like more information. 
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2.3.1 Matrix Analysis 

Our treatment of matrix analysis is drawn from Bhatia’s excellent books on matrix analysis [Bha97, 
Bha07]. The two books [HJ13, HJ94] of Horn & Johnson also serve as good general references. 
Higham’s work [Hig08] is a generous source of information about matrix functions. Other valu¬ 
able resources include Carlen’s lecture notes [CarlO], the book of Petz [Petll], and the book of 
Hiai&Petz [HP14], 

2.3.2 Probability with Matrices 

The classic introduction to probability is the two-volume treatise [Fel68, Fel71] of Feller. The 
book [GS01] of Grimmett & Stirzaker offers a good treatment of probability theory and random 
processes at an intermediate level. For a more theoretical presentation, consider the book [Shi96] 
of Shiryaev. 

There are too many books on random matrix theory for us to include a comprehensive list; 
here is a selection that the author finds useful. Tao’s book [Taol2] gives a friendly introduc¬ 
tion to some of the major aspects of classical and modern random matrix theory. The lecture 
notes [Keml3] of Kemp are also extremely readable. The survey ofVershynin [Verl2] provides a 
good summary of techniques from asymptotic convex geometry that are relevant to random ma¬ 
trix theory. The works of Mardia, Kent, & Bibby [MKB79] and Muirhead [Mui82] present classical 
results on random matrices that are particularly useful in statistics, while Bai & Silverstein [BS10] 
contains a comprehensive modern treatment. Nica and Speicher [NS06] offer an entree to the 
beautiful field of free probability. Mehta’s treatise [Meh04] was the first book on random matrix 
theory available, and it remains solid. 


CHAPTER 

The Matrix 
Laplace Transform Method 




This chapter contains the core part of the analysis that ultimately delivers matrix concentration 
inequalities. Readers who are only interested in the concentration inequalities themselves or the 
example applications may wish to move on to Chapters 4, 5, and 6. 

In the scalar setting, the Laplace transform method provides a simple but powerful way to 
develop concentration inequalities for a sum of independent random variables. This technique 
is sometimes referred to as the “Bernstein trick” or “Chernoff bounding.” For a primer, we rec¬ 
ommend [BLM13, Chap. 2], 

In the matrix setting, there is a very satisfactory extension of this argument that allows us to 
prove concentration inequalities for a sum of independent random matrices. As in the scalar 
case, the matrix Laplace transform method is both easy to use and incredibly useful. In contrast 
to the scalar case, the arguments that lead to matrix concentration are no longer elementary. The 
purpose of this chapter is to install the framework we need to support these results. Fortunately, 
in practical applications, all of the technical difficulty remains invisible. 


Overview 

We first define matrix analogs of the moment generating function and the cumulant generat¬ 
ing function, which pack up information about the fluctuations of a random Flermitian matrix. 
Section 3.2 explains how we can use the matrix mgf to obtain probability inequalities for the 
maximum eigenvalue of a random Hermitian matrix. The next task is to develop a bound for the 
mgf of a sum of independent random matrices using information about the summands. In §3.3, 
we discuss the challenges that arise; §3.4 presents the ideas we need to overcome these obsta¬ 
cles. Section 3.5 establishes that the classical result on additivity of cumulants has a companion 
in the matrix setting. This result allows us to develop a collection of abstract probability inequal¬ 
ities in §3.6 that we can specialize to obtain matrix Chernoff bounds, matrix Bernstein bounds, 
and so forth. 
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3.1 Matrix Moments and Cumulants 

At the heart of the Laplace transform method are the moment generating function (mgf) and 
the cumulant generating function (cgf) of a random variable. We begin by presenting matrix 
versions of the mgf and cgf. 

Definition 3.1.1 (Matrix Mgf and Cgf). Let X be a random Hermitian matrix. The matrix moment 
generating function M\ and the matrix cumulant generating function Ex are given by 

Mxid) - Ee 0x and Ex(0) = log Ee 0x for6 eM. (3.1.1) 

Note that the expectations may not exist for all values of 9. 

The matrix mgf Mx and matrix cgf Ex contain information about how much the random matrix 
X varies. We aim to exploit the data encoded in these functions to control the eigenvalues. 

Let us take a moment to expand on Definition 3.1.1; this discussion is not important for sub¬ 
sequent developments. Observe that the matrix mgf and cgf have formal power series expan¬ 
sions: 

OO QCj OO QCf 

M X [0) = I+£— (EX'?) and E x (0) - £ — V q . 
q=\ 9'- 9- 

We call the coefficients EX tf matrix moments, and we refer to y V as a matrix cumulant. The 
matrix cumulant T, ; has a formal expression as a (noncommutative) polynomial in the matrix 
moments up to order q. In particular, the first cumulant is the mean and the second cumulant 
is the variance: 

Ti^EX and ¥2 = EX 2 - (EX) 2 = Var(X) 

The matrix variance was introduced in (2.2.3). Higher-order cumulants are harder to write down 
and interpret. 

3.2 The Matrix Laplace Transform Method 

In the scalar setting, the Laplace transform method allows us to obtain tail bounds for a random 
variable in terms of its mgf. The starting point for our theory is the observation that a similar 
result holds in the matrix setting. 

Proposition 3.2.1 (Tail Bounds for Eigenvalues). Let Y be a random Hermitian matrix. For all 
t £ 05, 


P’Wmax(T) > t} < inf e Bt Etre 0r , and (3.2.1) 

0>o 

PUmintT) - - inf e ~ et Etre SF . (3.2.2) 

0<o 

In words, we can control the tail probabilities of the extreme eigenvalues of a random matrix 
by producing a bound for the trace of the matrix mgf. The proof of this fact parallels the classical 
argument, but there is a twist. 

Proof. We begin with (3.2.1). Fix a positive number 9, and observe that 

PUmax(T) > t} = p j e 0;lmax(r) > e 0t j < e~ et Ee 0Amax(F) = e~ 8t Ee Amax(0F) 
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The first identity holds because a >-»■ e 0a is a monotone increasing function, so the event does not 
change under the mapping. The second relation is Markov’s inequality (2.2.1). The last holds be¬ 
cause the maximum eigenvalue is a positive-homogeneous map, as stated in (2.1.4). To control 
the exponential, note that 

gAmaxten _ A max (e eF ) < tre 0y . (3.2.3) 

The first identity depends on the Spectral Mapping Theorem, Proposition 2.1.3, and the fact that 
the exponential function is increasing. The inequality follows because the exponential of an Her- 
mitian matrix is positive definite, and (2.1.13) shows that the maximum eigenvalue ofapositive- 
dehnite matrix is dominated by the trace. Combine the latter two displays to reach 

P{A max (F)> t}<e- gt Etre BY . 

This inequality is valid for any positive 9, so we may take an infimum to achieve the tightest 
possible bound. 

To prove (3.2.2), we use a similar approach. Fix a negative number 9, and calculate that 

P{A m in(F) < t} = p> j e 0Amin(F) > e 0f j < e~ 6t Ee 0Amin(F) = e“ 0f Ee Amax(0F) . 

The function a >-► e 0 " reverses the inequality in the event because it is monotone decreasing. The 
last identity depends on the relationship (2.1.5) between minimum and maximum eigenvalues. 
Finally, we introduce the inequality (3.2.3) for the trace exponential and minimize over negative 
values of 6. □ 

In the proof of Proposition 3.2.1, it may seem crude to bound the maximum eigenvalue by the 
trace. In fact, our overall approach leads to matrix concentration inequalities that are sharp for 
specific examples (see the discussion in §§4.1.2, 5.1.2, and 6.1.2), so we must conclude that the 
loss in this bound is sometimes inevitable. At the same time, this maneuver allows us to exploit 
some amazing convexity properties of the trace exponential. 

We can adapt the proof of Proposition 3.2.1 to obtain bounds for the expectation of the max¬ 
imum eigenvalue of a random Hermitian matrix. This argument does not have a perfect analog 
in the scalar setting. 

Proposition 3.2.2 (Expectation Bounds for Eigenvalues). Let Y be a random Hermitian matrix. 
Then 

EA max (F) ^ inf - log Etre 0F , and (3.2.4) 

0>o 0 

EA min (F) > sup ^ log Etre 0F . (3.2.5) 

0 <o 9 

Proof. We establish the bound (3.2.4); the proof of (3.2.5) is quite similar. Fix a positive number 
9, and calculate that 

EA max (F) = -|-Eloge A -“ (0F) < -’-logEe w0F) - ]- log EA max (e 0F ) < log Etre 0F . 
t) U a o 

The first identity holds because the maximum eigenvalue is a positive-homogeneous map, as 
stated in (2.1.4). The second relation is Jensen’s inequality. The third follows when we use the 
Spectral Mapping Theorem, Proposition 2.1.3, to draw the eigenvalue map through the expo¬ 
nential. The final inequality depends on the fact (2.1.13) that the trace of a positive-definite 
matrix dominates the maximum eigenvalue. O 
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3.3 The Failure of the Matrix Mgf 

We would like the use the Laplace transform bounds from Section 3.2 to study a sum of inde¬ 
pendent random matrices. In the scalar setting, the Laplace transform method is effective for 
studying an independent sum because the mgf and the cgf decompose. In the matrix case, the 
situation is more subtle, and the goal of this section is to indicate where things go awry. 

Consider an independent sequence {X k } of real random variables. The mgf of the sum satis¬ 
fies a multiplication rule: 


M M) (0) = Eexp(Z k dX k ) = E]J k e eXk = = Uk M x k W. (3.3.1) 

The first identity is the definition of an mgf. The second relation holds because the exponential 
map converts a sum of real scalars to a product, and the third relation requires the independence 
of the random variables. The last identity, again, is the definition. 

At first, we might imagine that a similar relationship holds for the matrix mgf. Consider an 
independent sequence {XU of random Hermitian matrices. Perhaps, 

M(L k x k) m = X\ k M Xk {6). (3.3.2) 

Unfortunately, this hope shatters when we subject it to interrogation. 

It is not hard to find the reason that (3.3.2) fails. The identity (3.3.1) depends on the fact that 
the scalar exponential converts a sum into a product. In contrast, for Hermitian matrices, 

e A+H ^ e A e H unless A and H commute. 

If we introduce the trace, the situation improves somewhat: 

tie A+H < tre'V' for all Hermitian A and H. (3.3.3) 

The result (3.3.3) is known as the Golden-Thompson inequality, a famous theorem from statisti¬ 
cal physics. Unfortunately, the analogous bound may fail for three matrices: 

tre A+H+T £ tre A e H e T for certain Hermitian A, H, and T. 

It seems that we have reached an impasse. 

What if we consider the cgf instead? The cgf of a sum of independent real random variables 
satisfies an addition rule: 


S E***)( 0 ) = logEexp(£ fc 0X fc ) = logl\ k Ee ex * = £ fc H Zjfc (0). (3.3.4) 

The relation (3.3.4) follows when we extract the logarithm of the multiplication rule (3.3.1). This 
result looks like a more promising candidate for generalization because a sum of Hermitian ma¬ 
trices remains Hermitian. We might hope that 

H E,tXt)(0) = t (0)- 

As stated, this putative identity also fails. Nevertheless, the addition rule (3.3.4) admits a very sat¬ 
isfactory extension to matrices. In contrast with the scalar case, the proof involves much deeper 
considerations. 
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3.4 A Theorem of Lieb 

To find the appropriate generalization of the addition rule for cgfs, we turn to the literature on 
matrix analysis. Here, we discover a famous result of Elliott Lieb on the convexity properties of 
the trace exponential function. 

Theorem 3.4.1 (Lieb). Fix an Hermitian matrix H with dimension d. The function 

A >—* trexp(JT + log A) 

is a concave map on the convex cone ofd x d positive-definite matrices. 

In the scalar case, the analogous function a <->■ expih + log a) is linear, so this result describes 
a new type of phenomenon that emerges when we move to the matrix setting. We present a 
complete proof of Theorem 3.4.1 in Chapter 8. 

Lor now, let us focus on the consequences of this remarkable result. Lieb’s Theorem is valu¬ 
able to us because the Laplace transform bounds from Section 3.2 involve the trace exponential 
function. To highlight the connection, we rephrase Theorem 3.4.1 in probabilistic terms. 

Corollary 3.4.2. Let H be a fixed Hermitian matrix, and let X be a random Hermitian matrix of 
the same dimension. Then 


Etrexp(H + X) < trexp(iT+logEe x ). 

Proof. Introduce the random matrix Y — e x . Then 

Etrexp(fT-i-X) = Etrexp(ff+logF) 

< trexp(ff-i-logEF) = trexp(JT + logEe x ). 

The first identity follows from the interpretation (2.1.17) of the matrix logarithm as the functional 
inverse of the matrix exponential. Theorem 3.4.1 shows that the trace function is concave in F, 
so Jensen’s inequality (2.2.2) allows us to draw the expectation inside the function. □ 

3.5 Subadditivity of the Matrix Cgf 

We are now prepared to generalize the addition rule (3.3.4) for scalar cgfs to the matrix setting. 
The following result is fundamental to our approach to random matrices. 

Lemma3.5.1 (Subadditivity of Matrix Cgfs). Consider a finite sequence {X^} of independent, ran¬ 
dom, Hermitian matrices of the same dimension. Then 

Etrexp(^ fc 0Xfc) < trexp |^ fc log Ee 0x<: j fordeU. (3.5.1) 

Equivalently, 

trexp(S Ejt x fc) (0)) ^ trexp(£ jt Ej Cjt (0)) ford eR. (3.5.2) 

The parallel between the additivity rule (3.3.4) and the subadditivity rule (3.5.2) is striking. 
With our level of preparation, it is easy to prove this result. We just apply the bound from Corol¬ 
lary 3.4.2 repeatedly. 
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Proof. Without loss of generality, we assume that 6 - 1 by absorbing the parameter into the ran¬ 
dom matrices. Let E k denote the expectation with respect to X k , the remaining random matrices 
held fixed. Abbreviate 

—k = log E fc e Xk = log E e Xk . 

We may calculate that 

Etrexp £" =1 X k ) = EE„ trexp (££:| X k + X n ) 

< Etrexp [Y,' k Z{ x k + log E„e x "j 

= EE„_i trexp + X„-i + E„) 

< EE „_2 trexp [Y.k=i X k + ~n- 1 + E„J 

^ trex p(ELi s fc)- 

We can introduce iterated expectations because of the tower property of conditional expectation. 
To bound the expectation E m for an index m = 1,2,3,..., n, we invoke Corollary 3.4.2 with the 
fixed matrix H equal to 

m—1 n 

H m = E E 

fc=l fc=m+l 

This argument is legitimate because H m is independent from X m . 

The formulation (3.5.2) follows from (3.5.1) when we substitute the expression (3.1.1) for the 
matrix cgf and make some algebraic simplifications. □ 

3.6 Master Bounds for Sums of Independent Random Matrices 

Finally, we can present some general results on the behavior of a sum of independent random 
matrices. At this stage, we simply combine the Laplace transform bounds with the subadditivity 
of the matrix cgf to obtain abstract inequalities. Later, we will harness properties of the sum¬ 
mands to develop more concrete estimates that apply to specific examples of interest. 

Theorem 3.6.1 (Master Bounds for a Sum of Independent Random Matrices). Consider a finite 


sequence {X k } of independent, random, Hermitian matrices of the same size. Then 

EAmax(Efc X t) - “f ^ log trexp [^logEe 0 **], and (3.6.1) 

EAmin(Efc X fc)- SU P 4 lo gtrexp (E fc lo 8 E ) ■ (3.6.2) 

Furthermore, for all t e R, 

P{Amax(Efc X fc) - f } - inf e“ er trexp [Efclog Ee eXi: ), and (3.6.3) 

P{Amin (Efc x fc) - f } - “rf e“ er trexp log Ee eXi: j. (3.6.4) 

Proof. Substitute the subadditivity rule for matrix cgfs, Lemma 3.5.1, into the two matrix Laplace 
transform results, Proposition 3.2.1 and Proposition 3.2.2. □ 
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In this chapter, we have focused on probability inequalities for the extreme eigenvalues of a 
sum of independent random matrices. Nevertheless, these results also give information about 
the spectral norm of a sum of independent, random, rectangular matrices because we can apply 
them to the Hermitian dilation (2.1.26) of the sum. Instead of presenting a general theorem, we 
find it more natural to extend individual results to the non-Hermitian case. 


3.7 Notes 

This section includes some historical discussion about the results we have described in this 
chapter, along with citations for the results that we have established. 

3.7.1 The Matrix Laplace Transform Method 

The idea of lifting the “Bernstein trick” to the matrix setting is due to two researchers in quan¬ 
tum information theory, Rudolf Ahlswede and Andreas Winter, who were working on a problem 
concerning transmission of information through a quantum channel [AW02] . Their paper con¬ 
tains a version of the matrix Laplace transform result, Proposition 3.2.1, along with a substantial 
number of related foundational ideas. Their work is one of the major inspirations for the tools 
that are described in these notes. 

The statement of Proposition 3.2.1 and the proof that we present appear in the paper [OlilOb] 
of Roberto Oliveira. The subsequent result on expectations, Proposition 3.2.2, first appeared in 
the paper [CGT12a]. 

3.7.2 Subadditivity of Cumulants 

The major impediment to applying the matrix Laplace transform method is the need to produce 
a bound for the trace of the matrix moment generating function (the trace mgf). This is where 
all the technical difficulty in the argument resides. 

Ahlswede & Winter [AW02, App.] proposed an approach for bounding the trace mgf of an 
independent sum, based on a repeated application of the Golden-Thompson inequality (3.3.3). 
Their argument leads to a cumulant bound of the form 

Etrex P(Lfc*fc) - d- ex P(Lfc A max(logEe Xfc )) (3.7.1) 

when the random Hermitian matrices X% have dimension d. In other words, Ahlswede & Winter 
bound the cumulant of a sum in terms of the sum of the maximum eigenvalues of the cumu¬ 
lants. There are cases where the bound (3.7.1) is equivalent with Lemma 3.5.1. For example, 
the estimates coincide when each matrix Xk is identically distributed. In general, however, the 
estimate (3.7.1) leads to fundamentally weaker results than our bound from Lemma 3.5.1. In 
the worst case, the approach of Ahlswede & Winter may produce an unnecessary factor of the 
dimension d in the exponent. See [Trollc, §§3.7, 4.8] for details. 

The first major technical advance beyond the original argument of Ahlswede & Winter ap¬ 
peared in a paper [OlilOa] of Oliveira. He developed a more effective way to deploy the Golden- 
Thompson inequality, and he used this technique to establish a matrix version of Freedman’s 
inequality [Fre75]. In the scalar setting, Freedman’s inequality extends the Bernstein concentra¬ 
tion inequality to martingales; Oliveira obtained the analogous extension of Bernstein’s inequal¬ 
ity for matrix-valued martingales. When specialized to independent sums, his result is quite 
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similar with the matrix Bernstein inequality, Theorem 1.6.2, apart from the precise values of the 
constants. Oliveira’s method, however, does not seem to deliver the full spectrum of matrix con¬ 
centration inequalities that we discuss in these notes. 

The approach here, based on Lieb’s Theorem, was introduced in the article [Trollc] by the 
author of these notes. This paper was apparently the first to recognize that Lieb’s Theorem has 
probabilistic content, as stated in Corollary 3.4.2. This idea leads to Lemma 3.5.1, on the subad¬ 
ditivity of cumulants, along with the master tail bounds from Theorem 3.6.1. Note that the two 
articles [OlilOa, Trollc] are independent works. 

For a detailed discussion of the benefits of Lieb’s Theorem over the Golden-Thompson in¬ 
equality, see [Trollc, §4]. In summary, to get the sharpest concentration results for random ma¬ 
trices, Lieb’s Theorem appears to be indispensible. The approach of Ahlswede & Winter seems 
intrinsically weaker. Oliveira’s argument has certain advantages, however, in that it extends from 
matrices to the fully noncommutative setting [JZ12] . 

Subsequent research on the underpinnings of the matrix Laplace transform method has led 
to a martingale version of the subadditivity of cumulants [Trolla, Trollb]; these works also de¬ 
pend on Lieb’s Theorem. The technical report [GT14] shows how to use a related result, called 
the Lieb-Seiringer Theorem [LS05], to obtain upper and lower tail bounds for all eigenvalues of 
a sum of independent random Hermitian matrices. 


3.7.3 Noncommutative Moment Inequalities 

There is a closely related, and much older, line of research on noncommutative moment in¬ 
equalities. These results provide information about the expected trace of a power of a sum of 
independent random matrices. The matrix Laplace transform method, as encapsulated in The¬ 
orem 3.6.1, gives analogous bounds for the exponential moments. 

Research on noncommutative moment inequalities dates to an important paper [LP86] of 
Fran^oise Lust-Piquard, which contains an operator extension of the Khintchine inequality. Tier 
result, now called the noncommutative Khintchine inequality, controls the trace moments of a 
sum of fixed matrices, each modulated by an independent Rademacher random variable; see 
Section 4.7.2 for more details. 

In recent years, researchers have generalized many other moment inequalities for a sum of 
scalar random variables to matrices (and beyond). For instance, the Rosenthal-Pinelis inequality 
for a sum of independent zero-mean random variables admits a matrix version [JZ13, MJC + 14, 
CGT12a]. We present a variant of the latter result below in (6.1.6). See the paper [JX05] foragood 
overview of some other noncommutative moment inequalities. 

Finally, and tangentially, we mention that a different notion of matrix moments and cumu¬ 
lants plays a central role in the theory of free probability [NS06] . 


3.7.4 Quantum Statistical Mechanics 

A curious feature of the theory of matrix concentration inequalities is that the most powerful 
tools come from the mathematical theory of quantum statistical mechanics. This held studies 
the bulk statistical properties of interacting quantum systems, and it would seem quite distant 
from the held of random matrix theory. The connection between these two areas has emerged 
because of research on quantum information theory, which studies how information can be en¬ 
coded, operated upon, and transmitted via quantum mechanical systems. 
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The Golden-Thompson inequality is a major result from quantum statistical mechanics. Bha- 
tia’s book [Bha97, Sec. IX.3] contains a detailed treatment of this result from the perspective of 
matrix theory. For an account with more physical content, see the book of Thirring [Thi02]. The 
fact that the Golden-Thompson inequality fails for three matrices can be obtained from simple 
examples, such as combinations of Pauli spin matrices [Bha97, Exer. IX.8.4]. 

Lieb’s Theorem [Lie73, Thm. 6] was first established in an important paper of Elliott Lieb on 
the convexity of trace functions. His main goal was to establish concavity properties for a func¬ 
tion that measures the amount of information in a quantum system. See the notes in Chapter 8 
for a more detailed discussion. 



CHAPTER 




Matrix Gaussian Series & 
Matrix Rademacher Series 


In this chapter, we present our first set of matrix concentration inequalities. These results pro¬ 
vide spectral information about a sum of fixed matrices, each modulated by an independent 
scalar random variable. This type of formulation is surprisingly versatile, and it captures a range 
of interesting examples. Our main goal, however, is to introduce matrix concentration in the 
simplest setting possible. 

To be more precise about our scope, let us introduce the concept of a matrix Gaussian series. 
Consider a finite sequence {B^} of fixed matrices with the same dimension, along with a finite 
sequence {y^} of independent standard normal random variables. We will study the spectral 
norm of the random matrix 

z= 'YjkYkBk- 

This expression looks abstract, but it has concrete modeling power. For example, we can express 
a Gaussian Wigner matrix, one of the classical random matrices, in this fashion. But the real value 
of this approach is that we can use matrix Gaussian series to represent many kinds of random 
matrices built from Gaussian random variables. This technique allows us to attack problems that 
classical methods do not handle gracefully. For instance, we can easily study a Toeplitz matrix 
with Gaussian entries. 

Similar ideas allow us to treat a matrix Rademacher series, a sum of fixed matrices modulated 
by random signs. (Recall that a Rademacher random variable takes the values +1 with equal 
probability.) The results in this case are almost identical with the results for matrix Gaussian 
series, but they allow us to consider new problems. As an example, we can study the expected 
spectral norm of a fixed real matrix after flipping the signs of the entries at random. 


Overview 

In §4.1, we begin with an overview of our results for matrix Gaussian series; very similar results 
also hold for matrix Rademacher series. Afterward, we discuss the accuracy of the theoretical 
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bounds. The subsequent sections, §§4.2-4.4, describe what the matrix concentration inequali¬ 
ties tell us about some classical and not-so-classical examples of random matrices. Section 4.5 
includes an overview of a more substantial application in combinatorial optimization. The final 
part §4.6 contains detailed proofs of the bounds. We conclude with bibliographical notes. 

4.1 A Norm Bound for Random Series with Matrix Coefficients 

Consider a finite sequence {bCs of real numbers and a finite sequence ly/J of independent stan¬ 
dard normal random variables. Form the random series Z = Y.kTk^k- A routine invocation of 
the scalar Laplace transform method demonstrates that 

P{[Z| > t] < 2 exp where v - Var(Z) = b 2 k . (4.1.1) 

It turns out that the inequality (4.1.1) extends directly to the matrix setting. 

Theorem 4.1.1 (Matrix Gaussian & Rademacher Series). Consider a finite sequence {B^} of fixed 
complex matrices with dimension d\ x d 2 , and let\jG be a finite sequence of independent standard 
normal variables. Introduce the matrix Gaussian series 


z = LkTkB k . (4.1.2) 

Let v[Z) be the matrix variance statistic of the sum: 

v[Z) — max{||E(ZZ*)||, ||E(Z*Z)||} (4.1.3) 

= max{||£ fc B fc B*||, ||^B^ fc ||}. (4.1.4) 

Then 

E||Z|| < \J2v{Z)\og(d\ + d 2 ). (4.1.5) 

Furthermore, for all t > 0, 

P1IIZII > t} < (di + d 2 ) exp ■ (4.1.6) 

The same bounds hold when we replace [jC by a finite sequence \gG of independent Rademacher 
random variables. 

The proof of Theorem 4.1.1 appears below in §4.6. 

4.1.1 Discussion 

Let us take a moment to discuss the content of Theorem 4.1.1. The main message is that the 
expectation of ||Z|| is controlled by the matrix variance statistic v(Z). Furthermore, ||Z|| has a 
subgaussian tail whose decay rate depends on v (Z). 

The matrix variance statistic v(Z) defined in (4.1.3) specializes the general formulation (2.2.8). 
The second expression (4.1.4) follows from the additivity property (2.2.11) for the variance of an 
independent sum. When the summands are Hermitian, observe that the two terms in the maxi¬ 
mum coincide. The formulas (4.1.3) and (4.1.4) are a direct extension of the variance that arises 
in the scalar bound (4.1.1). 
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Figure 4.1: Schematic of tail bound for matrix Gaussian series. Consider a matrix Gaus¬ 
sian series Z with dimension d\ x d 2 . The tail probability P 1 j||Z|| > t\ admits the upper bound 
(£?! + d 2 ) exp(- 1 2 / (2v(Z))), marked as a dark blue curve. This estimate provides no information 
below the level t = \j2u(Z) log(d[ + d 2 ). This value, the dark red vertical line, coincides with the 
upper bound (4.1.5) for E||Z||. As t increases beyond this point, the tail probability decreases at 
a subgaussian rate with variance on the order of v{Z). 


As compared with (4.1.1), a new feature of the bound (4.1.6) is the dimensional factor di + d 2 - 
When d\ - d 2 - l, the matrix bound reduces to the scalar result (4.1.1). In this case, at least, 
we have lost nothing by lifting the Laplace transform method to matrices. The behavior of the 
matrix tail bound (4.1.6) is more subtle than the behavior of the scalar tail bound (4.1.1). See 
Figure 4.1 for an illustration. 

4.1.2 Optimality of the Bounds for Matrix Gaussian Series 

One may wonder whether Theorem 4.1.1 provides accurate information about the behavior of a 
matrix Gaussian series. The answer turns out to be complicated. Here is the executive summary: 
the expectation bound (4.1.5) is always quite good, but the tail bound (4.1.6) is sometimes quite 
bad. The rest of this section expands on these claims. 

The Expectation Bound 

Let Z be a matrix Gaussian series of the form (4.1.2). We will argue that 
v(Z) < E || Z|| 2 < 2i/(Z)(l + log(di + d 2 )). 


(4.1.7) 
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In other words, the matrix variance v(Z) is roughly the correct scale for \\Z\\ 2 . This pair of es¬ 
timates is a significant achievement because it is quite challenging to compute the norm of a 
matrix Gaussian series in general. Indeed, the literature contains very few examples where ex¬ 
plicit estimates are available, especially if one desires reasonable constants. 

We begin with the lower bound in (4.1.7), which is elementary. Indeed, since the spectral 
norm is convex, Jensen’s inequality ensures that 

E ||Z|| 2 = E max] \\ZZ* ||, ||Z*Z|| } > max] ||E(ZZ*)||, ||E(Z*Z)||} = v{Z). 

The first identity follows from (2.1.24), and the last is the definition (2.2.8) of the matrix variance. 
The upper bound in (4.1.7) is a consequence of the tail bound (4.1.6): 

POO 

E ||Z|| 2 = / 2fP{||Z|| > t}dt 

Jo 

<[ 2tdt + 2(d 1 + d 2 ) f f e _f2/(2w(Z)) df = E 2 +2v[Z) (di + d 2 ) e -fi2/t2l;(Z)) . 

Jo Je 

In the first step, rewrite the expectation using integration by parts, and then split the integral at 
a positive number E. In the first term, we bound the probability by one, while the second term 
results from the tail bound (4.1.6). Afterward, we compute the integrals explicitly. Finally, select 
E 2 = 2t'(Z)log(di + d 2 ) to complete the proof of (4.1.7). 

About the Dimensional Factor 

At this point, one may ask whether it is possible to improve either side of the inequality (4.1.7). 
The answer is negative unless we have additional information about the Gaussian series beyond 
the matrix variance statistic u(Z). 

Indeed, for arbitrarily large dimensions d\ and d 2 , we can exhibit a matrix Gaussian series 
where the left-hand inequality in (4.1.7) is correct. That is, E ||Z|| 2 ~ u(Z) with no additional 
dependence on the dimensions d\ or d 2 . One such example appears below in §4.2.2. 

At the same time, for arbitrarily large dimensions d\ and d 2 , we can construct a matrix Gaus¬ 
sian series where the right-hand inequality in (4.1.7) is correct. That is, E || Z|| 2 ~ u(Z) log(<7| + d 2 ). 
See §4.4 for an example. 

We can offer a rough intuition about how these two situations differ from each other. The 
presence or absence of the dimensional factor log(r/| + d 2 ) depends on how much the coeffi¬ 
cients Bi in the matrix Gaussian series Z commute with each other. More commutativity leads 
to a logarithm, while less commutativity can sometimes result in cancelations that obliterate the 
logarithm. It remains a major open question to find a simple quantity, computable from the 
coefficients B^, that decides whether E ||Z|| 2 contains a dimensional factor or not. 

In Chapter 7, we will describe a technique that allows us to moderate the dimensional factor 
in (4.1.7) for some types of matrix series. But we cannot remove the dimensional factor entirely 
with current technology. 

The Tail Bound 

What about the tail bound (4.1.6) for the norm of the Gaussian series? Here, our results are less 
impressive. It turns out that the large-deviation behavior of the spectral norm of a matrix Gaus¬ 
sian series Z is controlled by a statistic u + (Z) called the weak variance: 

u*(Z)= sup E|m* Zw \ 2 = sup V\ \u*Bi c w\ 2 . 

||«|| = ||u»||=l ||u||=l|iv||=l 
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The best general inequalities between the matrix variance statistic and the weak variance are 

MZ) < v{Z) < mm{di,d 2 }-v+(Z) 

There are examples of matrix Gaussian series that saturate the lower or the upper inequality. 

The classical concentration inequality [BLM13, Thm. 5.6] for a function of independent Gaus¬ 
sian random variables implies that 

P{||Z|| > E||Z|| + t}< e ~ t2 R 2v AZ))_ (4.1.8) 

Let us emphasize that the bound (4.1.8) provides no information about E||Z||; it only tells us 
about the probability that ||Z|| is larger than its mean. 

Together, the last two displays indicate that the exponent in the tail bound (4.1.6) is some¬ 
times too big by a factor min {d\,d 2 }. Therefore, a direct application of Theorem 4.1.1 can badly 
overestimate the tail probability P {|| Z|| > t } when the level t is large. Fortunately, this problem 
is less pronounced with the matrix Chernoff inequalities of Chapter 5 and the matrix Bernstein 
inequalities of Chapter 6. 


Expectations and Tails 

When studying concentration of random variables, it is quite common that we need to use one 
method to assess the expected value of the random variable and a separate technique to deter¬ 
mine the probability of a large deviation. 

The primary value of matrix concentration inequalities inheres in the estimates 
that they provide for the expectation of the spectral norm (or maximum eigen¬ 
value or minimum eigenvalue) of a random matrix. 

In many cases, matrix concentration bounds provide reasonable information about the tail de¬ 
cay, but there are other situations where the tail bounds are feeble. In this event, we recommend 
applying a scalar concentration inequality to control the tails. 

4.2 Example: Some Gaussian Matrices 

Let us try out our methods on two types of Gaussian matrices that have been studied extensively 
in the classical literature on random matrix theory. In these cases, precise information about the 
spectral distribution is available, which provides a benchmark for assessing our results. We find 
that bounds based on Theorem 4.1.1 lead to very reasonable estimates, but they are not sharp. 
The advantage of our approach is that it applies to every example, whereas we are making com¬ 
parisons with specialized techniques that only illuminate individual cases. Similar conclusions 
hold for matrices with independent Rademacher entries. 

4.2.1 Gaussian Wigner Matrices 

We begin with a family of Gaussian Wigner matrices. A d x d matrix Wj from this ensemble 
is real-symmetric with a zero diagonal; the entries above the diagonal are independent normal 
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variables with mean zero and variance one: 



0 

712 

713 

7i d 


712 

0 

723 

J2d 


713 

723 

0 

73 d 


.71 d 

J2d 


Td-l,d 0 


where {jjk : 1 < j < k < d] is an independent family of standard normal variables. We can repre¬ 
sent this matrix compactly as a Gaussian series: 

w d = E rjk(Ejk + E kj ). (4.2.1) 

1 <j<k<d 

The norm of a Wigner matrix satisfies 

—— II Wrf|| —* 2 as d —*• oo, almost surely. (4.2.2) 

Vd 

For example, see [BS10, Thm. 5.1]. To make (4.2.2) precise, we assume that {W^} is an indepen¬ 
dent sequence of Gaussian Wigner matrices, indexed by the dimension d. 

Theorem 4.6.1 provides a simple way to bound the norm of a Gaussian Wigner matrix. We 
just need to compute the matrix variance statistic v ( W d ). The formula (4.1.4) for v(W d ) asks us 
to form the sum of the squared coefficients from the representation (4.2.1): 

E (Ej fc + E fcj ) 2 = E (E ;7 + E fcfc ) = W-l)I d . 

1<j<fc<d 1 <;<fc<d 

Since the terms in (4.2.1) are Hermitian, we have only one sum of squares to consider. We have 
also used the facts that = Ejj while Ej^Ej^ = 0 because of the condition j < k in the limits 

of summation. We see that 


v{W d ) = 


E (Eyfc + E kj f 

1 <;'<fc<d 


= ||(d-l)I d || = d-l. 


The bound (4.1.5) for the expectation of the norm gives 


E|| W d || < ^2{d-\)\og(2d). (4.2.3) 

In conclusion, our techniques overestimate || W d II by a factor of about >/0.51og d. The result (4.2.3) 
is not perfect, but it only takes two lines of work. In contrast, the classical result (4.2.2) depends 
on a long moment calculation that involves challenging combinatorial arguments. 


4.2.2 Rectangular Gaussian Matrices 

Next, we consider a d\ x d 2 rectangular matrix with independent standard normal entries: 


7n 

712 

713 ■ 

71^2 

721 

722 

723 

72d 2 

7dil 

7rfi2 

7di3 

■ • Y did 2 
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where {jjk\ is an independent family of standard normal variables. We can express this matrix 
efficiently using a Gaussian series: 

d\ d,2 

G =LLrjkVjk, (4.2.4) 

;=1 k=l 

There is an elegant estimate [DS02, Thm. 2.13] for the norm of this matrix: 

E ||G|| < \fd[ + \f~d 2 . (4.2.5) 

The inequality (4.2.5) is sharp when d\ and d 2 tend to infinity while the ratio d\ / d 2 —>• const. 
See [BS10, Thm. 5.8] for details. 

Theorem 4.1.1 yields another bound on the expected norm of the matrix G. In order to com¬ 
pute the matrix variance statistic v(G), we calculate the sums of the squared coefficients from 
the representation (4.2.4): 

d\ d2 d\ d2 

II II = II II E 77 = ^2 Irfi > and 

j-lk-1 j-1 k-1 

d\ d2 d\ d2 

II II E ;jt E ;'fc “ II II E fcfc ~ 

j=lk=l j-lk-l 

The matrix variance statistic (4.1.3) satisfies 

y(G) = max{||d 2 I dl ||, IIdi Id, II} = max{di, d 2 }. 


We conclude that 

E||G|| < ^/2max{di, d 2 }log{di + d 2 ). (4.2.6) 

The leading term is roughly correct because 

\fd\ + \/~d 2 < 2\/max{di, d 2 ] < 2 1 \fcfi + yj~d 2 j. 

The logarithmic factor in (4.2.6) does not belong, but it is rather small in comparison with the 
leading terms. Once again, we have produced a reasonable result with a short argument based 
on general principles. 

4.3 Example: Matrices with Randomly Signed Entries 

Next, we turn to an example that is superficially similar with the matrix discussed in §4.2.2 but is 
less understood. Consider a fixed d\ x d 2 matrix B with real entries, and let {p jC be an indepen¬ 
dent family of Rademacher random variables. Consider the d\ x d 2 random matrix 

d\ d2 

B± — II H PjkbjkEjk 

7=1 fc=l 

In other words, we obtain the random matrix B+ by randomly flipping the sign of each entry of 
B. The expected norm of this matrix satisfies the bound 

E||B+|| < Const- v 112 -log 1/4 min{rfi, d 2 }, 


(4.3.1) 
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where the leading factor v 112 satisfies 


v = max{max ; '||foj:ll 2 , max k \\b :k \\ 2 }. (4.3.2) 

We have written bj■ for the jth row of B and b k for the fcth column of B. In other words, the 
expected norm of a matrix with randomly signed entries is comparable with the maximum £2 
norm achieved by any row or column. There are cases where the bound (4.3.1) admits a matching 
lower bound. These results appear in [SegOO, Thms. 3.1, 3.2] and [BV14, Cor. 4.7]. 

Theorem 4.1.1 leads to a quick proof of a slightly weaker result. We simply need to compute 
the matrix variance statistic v(B±). To that end, note that 


Similarly, 


Therefore, using the formula (4.1.4), we find that 

{ d\ d,2 

E E ( b j k E jk ){b jk E jk ) 

j=lk=l 

= max {maxj ||fo/ : || 2 , max t ||b :fc || 2 }. 


d\ d2 

d\ 

r d2 


Vn-.f 


E Z(b ]k E ]lc )(b jk E jk y 

= E 

E \bjk\ 2 

E;; = 



7=1*= 1 

i =1 

U=i 



Wbdy.W 2 . 

d\ d2 

^2 

l dl 0) 


Hfo:lH 2 


E E (bj k E jk )* (bj k Ej k ) 

= E 

E \bjk\ 2 

E kk = 



7=1 *=1 

k= 1 

0=i J 



II b :( i 2 II 2 . 


E 'E(b jk E jk )*(b jk Ej k ) 
j=ik=i 


We see that v[B+) coincides with v, the leading term (4.3.2) in the established estimate (4.3.1)! 
Now, Theorem 4.1.1 delivers the bound 

E||B±|| < y / 2i'(B+)log(d 1 + d 2 )- (4.3.3) 

Observe that the estimate (4.3.3) for the norm matches the correct bound (4.3.1) up to the log¬ 
arithmic factor. Yet again, we obtain a result that is respectably close to the optimal one, even 
though it is not quite sharp. 

The main advantage of using results like Theorem 4.1.1 to analyze this random matrix is 
that we can obtain a good result with a minimal amount of arithmetic. The analysis that leads 
to (4.3.1) involves a specialized combinatorial argument. 


4.4 Example: Gaussian Toeplitz Matrices 

Matrix concentration inequalities offer an effective tool for analyzing random matrices whose 
dependency structures are more complicated than those of the classical ensembles. In this sec¬ 
tion, we consider Gaussian Toeplitz matrices, which have applications in signal processing. 

We construct an (unsymmetric) dx d Gaussian Toeplitz matrix l\j by populating the first row 
and first column of the matrix with independent standard normal variables; the entries along 
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each diagonal of the matrix take the same value: 



To 

7i 



Td-l 


7-1 

7o 

7i 



r rf = 


7-i 

7o 7i 






7-1 

7o 

71 


■ T-W-i) 



7-1 

70 


where {y^l is an independent family of standard normal variables. As usual, we represent the 
Gaussian Toeplitz matrix as a matrix Gaussian series: 


Frf = To I + E 1 TfcC fc + X! T-k[C k ) *, 


k-l k=l 


(4.4.1) 


where C e M r / denotes the shift-up operator acting on d -dimensional column vectors: 


0 1 
0 


C — 


0 1 
0 


It follows that C k shifts a vector up by k places, introducing zeros at the bottom, while (C k ) * 
shifts a vector down by k places, introducing zeros at the top. 

We can analyze this example quickly using Theorem 4.1.1. First, note that 

{C k ){C k Y= XEj'i and (C fc )*(C fc )= £ E n . 

7=1 j=k+l 

To obtain the matrix variance statistic (4.1.4), we calculate the sum of the squares of the coeffi¬ 
cient matrices that appear in (4.4.1). In this instance, the two terms in the variance are the same. 
We find that 


i 2 + X [c k )[c k y + x (c fc )*(c fc )=i + x 

k= 1 k= 1 k=l 

d d-j ]-1 

= L i+Li+Li 

7=1 k= 1 fc=l 


d k d 

H E 77 + H E J7 
7=1 ]=k +1 


E 77 = X (! + W-7) + (J - 13) E 77 = dl d . (4.4.2) 
7=1 


In the second line, we (carefully) switch the order of summation and rewrite the identity matrix 
as a sum of diagonal standard basis matrices. We reach 


idld) = M Id II = d. 

An application of Theorem 4.1.1 leads us to conclude that 


E||r d || < ^2d\og(2d) 


(4.4.3) 
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It turns out that the inequality (4.4.3) is correct up to the precise value of the constant, which 
does not seem to be known. Nevertheless, the limiting value is available for the top eigenvalue 
of a (scaled) symmetric Toeplitz matrix whose first row contains independent standard normal 
variables [SV13, Thm. 1], From this result, we may conclude that 

E||r rf n 

0.8288 < , < 1 asd^oo. 

■y/2dlog(2d) 

Here, we take {1^} to be a sequence of unsymmetric Gaussian Toeplitz matrices, indexed by the 
ambient dimension d. Our simple argument gives the right scaling for this problem, and our 
estimate for the constant lies within 21% of the optimal value! 

4.5 Application: Rounding for the MaxQP Relaxation 

Our final application involves a more substantial question from combinatorial optimization. 
One of the methods that has been proposed for solving a certain optimization problem leads to 
a matrix Rademacher series, and the analysis of this method requires the spectral norm bounds 
from Theorem 4.1.1. A detailed treatment would take us too far afield, so we just sketch the 
context and indicate how the random matrix arises. 

There are many types of optimization problems that are computationally difficult to solve 
exactly. One approach to solving these problems is to enlarge the constraint set in such a way that 
the problem becomes tractable, a process called “relaxation.” After solving the relaxed problem, 
we can use a randomized “rounding” procedure to map the solution back to the constraint set 
for the original problem. If we can perform the rounding step without changing the value of 
the objective function substantially, then the rounded solution is also a decent solution to the 
original optimization problem. 

One difficult class of optimization problems has a matrix decision variable, and it requires 
us to maximize a quadratic form in the matrix variable subject to a set of convex quadratic con¬ 
straints and a spectral norm constraint [Nem07]. This problem is referred to as MaxQP. The 
desired solution B to this problem is a d\ x d 2 matrix. The solution needs to satisfy several dif¬ 
ferent requirements, but we focus on the condition that ||B|| < 1. 

There is a natural relaxation of the MaxQP problem. When we solve the relaxation, we obtain 
a family { B k : k- 1,2, ...,n} of d\ x c/ 2 matrices that satisfy the constraints 

l dl and t B * k B k ^ l d2 . (4.5.1) 

k =1 k =1 

In fact, these two bounds are part of the specification of the relaxed problem. To round the family 
of matrices back to a solution of the original problem, we form the random matrix 

n 

Z = a £ g k B k , 
k= 1 

where {g k } is an independent family of Rademacher random variables. The scaling factor a > 0 
can be adjusted to guarantee that the norm constraint ||Z|| < 1 holds with high probability. 

What is the expected norm of Z? Theorem 4.1.1 yields 

EIIZ|| < yj2v(Z) log(di + dz). 
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Here, the matrix variance statistic satisfies 


f 

n 


n 

v{Z) - a 1 maxi 

E B k B* k 

k= 1 

’ 

E B* k B k 

k= 1 


< a 


2 


owing to the constraint (4.5.1) on the matrices B i,..., B n . It follows that the scaling parameter a 
should satisfy 


21og(di + d 2 ) 


to ensure that E ||Z|| < 1. For this choice of a, the rounded solution Z obeys the spectral norm 
constraint on average. By using the tail bound (4.1.6), we can even obtain high-probability esti¬ 
mates for the norm of the rounded solution Z. 

The important fact here is that the scaling parameter a is usually small as compared with 
the other parameters of the problem (d\,d 2 , n, and so forth). Therefore, the scaling does not 
have a massive effect on the value of the objective function. Ultimately, this approach leads to a 
technique for solving the MaxQP problem that produces a feasible point whose objective value 
is within a factor of y/2log(d] + d 2 ) of the maximum objective value possible. 


4.6 Analysis of Matrix Gaussian & Rademacher Series 

We began this chapter with a concentration inequality, Theorem 4.1.1, for the norm of a matrix 
Gaussian series, and we have explored a number of different applications of this result. This 
section contains a proof of this theorem. 


4.6.1 Random Series with Hermitian Coefficients 


As the development in Chapter 3 suggests, random Hermitian matrices provide the natural set¬ 
ting for establishing matrix concentration inequalities. Therefore, we begin our treatment with 
a detailed statement of the matrix concentration inequality for a Gaussian series with Hermitian 
matrix coefficients. 


Theorem 4.6.1 (Matrix Gaussian & Rademacher Series: The Hermitian Case). Consider a finite 
sequence {AO of fixed Hermitian matrices with dimension d, and let {y/J he a finite sequence of 
independent standard normal variables. Introduce the matrix Gaussian series 

Y = E/fcFi^fc- 


Let v{Y) be the matrix variance statistic of the sum: 


f(T) = IIEF 2 || = . 

Then 

EA max (F)<^2tdr)logd. 

Furthermore, for all t > 0, 

IP Umax (F) ^ t}<dex p( 2t/( yj - 


(4.6.1) 


(4.6.2) 


(4.6.3) 


The same bounds hold when we replace {yf by a finite sequence of independent Rademacher 
random variables. 
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The proof of this result occupies the rest of the section. 

4.6.2 Discussion 

Before we proceed to the analysis, let us take a moment to compare Theorem 4.6.1 with the result 
for general matrix series, Theorem 4.1.1. 

First, we consider the matrix variance statistic y(F) defined in (4.6.1). Since Y has zero mean, 
this definition coincides with the general formula (2.2.4). The second expression, in terms of 
the coefficient matrices, follows from the additivity property (2.2.6) for the variance of a sum of 
independent, random Hermitian matrices. 

Next, bounds for the minimum eigenvalue AmiMF) follow from the results for the maximum 
eigenvalue because - F has the same distribution as F. Therefore, 

EA min (F) = EA min (-F) = -EA max (F) > ~^2v{Y)\ogd. (4.6.4) 

The second identity holds because of the relationship (2.1.5) between minimum and maximum 
eigenvalues. Similar considerations lead to a lower tail bound for the minimum eigenvalue: 

P{Amin(y) < - 1} < d exp1 2v * Y j | for f > 0. (4.6.5) 

This result follows directly from the upper tail bound (4.6.3). 

This observation points to the most important difference between the Hermitian case and 
the general case. Indeed, Theorem 4.6.1 concerns the extreme eigenvalues of the random series 
F instead of the norm. This change amounts to producing one-sided tail bounds instead of two- 
sided tail bounds. For Gaussian and Rademacher series, this improvement is not really useful, 
but there are random Hermitian matrices whose minimum and maximum eigenvalues exhibit 
different types of behavior. For these problems, it can be extremely valuable to examine the two 
tails separately. See Chapter 5 and 6 for some results of this type. 

4.6.3 Analysis for Hermitian Gaussian Series 

We continue with the proof that matrix Gaussian series exhibit the behavior described in Theo¬ 
rem 4.6.1. Afterward, we show how to adapt the argument to address matrix Rademacher series. 
Our main tool is Theorem 3.6.1, the set of master bounds for independent sums. To use this 
result, we must identify the cgf of a fixed matrix modulated by a Gaussian random variable. 

Lemma 4.6.2 (Gaussian x Matrix: MgfandCgf). Suppose that A is a fixed Hermitian matrix, and 
letj be a standard normal random variable. Then 

Ee rSA = e 02 ^ 12 and log Ee r0j4 = A 2 ford e U. 

Proof. We may assume 6 - 1 by absorbing 0 into the matrix A. It is well known that the moments 
of a standard normal variable satisfy 


(2 q)\ 
2 <lq\ 


E(y 2<?+1 ) - 0 and E (y 2(? ) 


for q - 0,1,2. 
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The formula for the odd moments holds because a standard normal variable is symmetric. One 
way to establish the formula for the even moments is to use integration by parts to obtain a 
recursion for the (2^)th moment in terms of the (2 q- 2)th moment. 

Therefore, the matrix mgf satisfies 


o° OO 1 

Ee rA =1+ £ = 1+ £ -{A 2 !2f 

<7=1 <7= i 


.W/2 


The first identity holds because the odd terms vanish from the series representation (2.1.15) of 
the matrix exponential when we take the expectation. To compute the cgf, we extract the log¬ 
arithm of the mgf and recall (2.1.17), which states that the matrix logarithm is the functional 
inverse of the matrix exponential. □ 


We quickly reach results on the maximum eigenvalue of a matrix Gaussian series with Her- 
mitian coefficients. 


Proof of Theorem 4.6.1: Gaussian Case. Consider a finite sequence {A^} of Hermitian matrices 
with dimension d, and let [jfi be a finite sequence of independent standard normal variables. 
Define the matrix Gaussian series 

Y = LkTkA k . 

We begin with the upper bound (4.6.2) for EA max (F). The master expectation bound (3.6.1) from 
Theorem 3.6.1 implies that 

E^-maxflG ^ 


The second line follows when we introduce the cgf from Lemma 4.6.2. To reach the third in¬ 
equality, we bound the trace by the dimension times the maximum eigenvalue. The fourth line 
is the Spectral Mapping Theorem, Proposition 2.1.3. Use the formula (4.6.1) to identify the ma¬ 
trix variance statistic f(F) in the exponent. The infimum is attained at 0 - \/2v{Y)~ 1 log d. This 
choice leads to (4.6.2). 

Next, we turn to the proof of the upper tail bound (4.6.3) for A max (F). Invoke the master tail 
bound (3.6.3) from Theorem 3.6.1, and calculate that 

P{A max (F) > t} < inf e~ dt trexp(£ fc logEe^ 0A *) 

= i„fe-„exp(^E t 4) 

s K e “' d exp ( t Lt 4)) 


inf — log trexp |^ t .log Ee rA:0Ai: j 

1 2 
“»° 8,rap |lL4 

a s ' 08 


d A max exp 




V 2 


k^k 


inf —log 
8>0 0 6 


/ q2 

d exp — A max k A 2 k ) 


inf - 

e >o 6 


logd + 


V 2 
0 2 v{Y) 
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= d inf e -et^vmn 

0>O 

The steps here are the same as in the previous calculation. The infimum is achieved at 0 - 
t!v(Y), which yields (4.6.3). □ 


4.6.4 Analysis for Hermitian Rademacher Series 

The inequalities for matrix Rademacher series involve arguments closely related to the proofs for 
matrix Gaussian series, but we require one additional piece of reasoning to obtain the simplest 
results. First, let us compute bounds for the matrix mgf and cgf of a Hermitian matrix modulated 
by a Rademacher random variable. 


Lemma 4.6.3 (Rademacher x Matrix: Mgf and Cgf). Suppose that A is a fixed Hermitian matrix, 
and let p be a Rademacher random variable. Then 


Ee 1 


gOA 


=4 e 


6 2 A l ! 2 


e 2 


and log E e p0A =<: — A 2 for 6 e 


Proof. First, we establish a scalar inequality. Comparing Taylor series, 


oo £j2cf oo 

cosh(a) = y - < y - 

yo(2 q)\ £ 0 2 1q\ 


~arl 2 


for a e [ 


The inequality holds because (2 q)\> (2q){2q- 2) • •• (4) (2) = 2 C> q\. 

To compute the matrix mgf, we may assume 9 = 1. By direct calculation, 


(4.6.6) 


Ee eA = ±e A + ±e“ A = cosh(A) =$ e A 12 . 

The semidefinite bound follows when we apply the Transfer Rule (2.1.14) to the inequality (4.6.6) . 
To determine the matrix cgf, observe that 

logEe pA = logcosh(A) =<! | A 2 . 

The semidefinite bound follows when we apply the Transfer Rule (2.1.14) to the scalar inequality 
logcosh(a) < a 2 /2 for a e IR, which is a consequence of (4.6.6). □ 


We are prepared to develop some probability inequalities for the maximum eigenvalue of a 
Rademacher series with Hermitian coefficients. 


Proof of Theorem 4.6.1: Rademacher Case. Consider a finite sequence {A^-} of Hermitian matri¬ 
ces, and let {g^} be a finite sequence of independent Rademacher variables. Define the matrix 
Rademacher series 

Y = LkCkA k . 

The bounds for the extreme eigenvalues of Y follow from an argument almost identical with the 
proof in the Gaussian case. The only point that requires justification is the inequality 

trexp (y log Ee^ eA ‘) < trexp (y £ k A 2 ]. 

To obtain this result, we introduce the semidefinite bound, Lemma 4.6.3, for the Rademacher 
cgf into the trace exponential. The left-hand side increases after this substitution because of the 
fact (2.1.16) that the trace exponential function is monotone with respect to the semidefinite 
order. □ 
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4.6.5 Analysis of Matrix Series with Rectangular Coefficients 

Finally, we consider a series with non-Flermitian matrix coefficients modulated by independent 
Gaussian or Rademacher random variables. The bounds for the norm of a rectangular series 
follow instantly from the bounds for the norm of an Hermitian series because of a formal device. 
We simply apply the Flermitian results to the Hermitian dilation (2.1.26) of the series. 


Proof of Theorem 4.1.1. Consider a finite sequence {B^} of d\ x d 2 complex matrices, and let {^-| 
be a finite sequence of independent random variables, either standard normal or Rademacher. 
Recall from Definition 2.1.5 that the Hermitian dilation is the map 


Jf:B 


0 

B* 


B 

0 


This leads us to form the two series 


z = Lk<kB k and Y = = ^ k ( k J^{B k ). 

The second expression for Y holds because the Hermitian dilation is real-linear. Since we have 
written F as a matrix series with Hermitian coefficients, we may analyze it using Theorem 4.6.1. 
We just need to express the conclusions in terms of the random matrix Z. 

First, we employ the fact (2.1.28) that the Hermitian dilation preserves spectral information: 

\\Z\\ = A max (J$?(Z)) = A max (F). 


Therefore, bounds on A max (F) deliver bounds on ||Z||. In view of the calculation (2.2.10) for the 
variance statistic of a dilation, we have 


v{Y ) = v{Jf{Z)) = v{Z). 


Recall that the matrix variance statistic v(Z ) defined in (4.1.3) coincides with the general defini¬ 
tion from (2.2.8). Now, invoke Theorem 4.6.1 to obtain Theorem 4.1.1. □ 


4.7 Notes 

We give an overview of research related to matrix Gaussian series, along with references for the 
specific random matrices that we have analyzed. 

4.7.1 Matrix Gaussian and Rademacher Series 

The main results, Theorem 4.1.1 and Theorem 4.6.1, have an interesting history. In the precise 
form presented here, these two statements first appeared in [Trollc], but we can trace them back 
more than two decades. 

In his work [OlilOb, Thm. 1], Oliveira established the mgf bounds presented in Lemma 4.6.2 
and Lemma 4.6.3. He also developed an ingenious improvement on the arguments of Ahlswede 
& Winter [AW02, App.], and he obtained a bound similar with Theorem 4.6.1. The constants in 
Oliveira’s result are worse, but the dependence on the dimension is better because it depends 
on the number of summands. We do not believe that the approach Ahlswede & Winter describe 
in [AW02] can deliver any of these results. 

Recently, there have been some minor improvements to the dimensional factor that appears 
in Theorem 4.6.1. We discuss these results and give citations in Chapter 7. 
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4.7.2 The Noncommutative Khintchine Inequality 

Our theory about matrix Rademacher and Gaussian series should be compared with a clas¬ 
sic result, called the noncommutative Khintchine inequality, that was originally due to Lust- 
Piquard [LP86]; see also the follow-up work [LPP91]. In its simplest form, this inequality con¬ 
cerns a matrix Rademacher series with Hermitian coefficients: 


Y = L k ekA k 


The noncommutative Khintchine inequality states that 

Etr[F 2(? ] < C 2 q tr[(EY 2 y'} for q = 1,2,3. (4.7.1) 

The minimum value of the constant Czq = {2q)ll (2 CI q\) was obtained in the two papers [BucOl, 
Buc05]. Traditional proofs of the noncommutative Khintchine inequality are quite involved, but 
there is now an elementary argument available [MJC + 14, Cor. 7.3], 

Theorem 4.6.1 is the exponential moment analog of the polynomial moment bound (4.7.1). 
The polynomial moment inequality is somewhat stronger than the exponential moment in¬ 
equality. Nevertheless, the exponential results are often more useful in practice. For a more 
thorough exploration of the relationships between Theorem 4.6.1 and noncommutative moment 
inequalities, such as (4.7.1), see the discussion in [Trollc, §4]. 


4.7.3 Application to Random Matrices 

It has also been known for a long time that results such as Theorem 4.6.1 and inequality (4.7.1) 
can be used to study random matrices. 

We believe that the geometric functional analysis literature contains the earliest applications 
of matrix concentration results to analyze random matrices. In a well-known paper [Rud99], 
Mark Rudelson—acting on a suggestion of Gilles Pisier—showed how to use the noncommuta¬ 
tive Khintchine inequality (4.7.1) to study covariance estimation. This work led to a significant 
amount of activity in which researchers used variants of Rudelson’s argument to prove other 
types of results. See, for example, the paper [RV07] . This approach is powerful, but it tends to 
require some effort to use. 

In parallel, other researchers in noncommutative probability theory also came to recognize 
the power of noncommutative moment inequalities in random matrix theory. The paper [JX08] 
contains a specific example. Unfortunately, this literature is technically formidable, which makes 
it difficult for outsiders to appreciate its achievements. 

The work [AW02] of Ahlswede & Winter led to the first “packaged" matrix concentration in¬ 
equalities of the type that we describe in these lecture notes. For the first few years after this work, 
most of the applications concerned quantum information theory and random graph theory. The 
paper [Groll] introduced the method of Ahlswede & Winter to researchers in mathematical sig¬ 
nal processing and statistics, and it served to popularize matrix concentration bounds. 

At this point, the available matrix concentration inequalities were still significantly subopti- 
mal. The main advances, in [OlilOa, Trollc], led to optimal matrix concentration results of the 
kind that we present in these lecture notes. These results allow researchers to obtain reasonably 
accurate analyses of a wide variety of random matrices with very little effort. 
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4.7.4 Wigner and Marcenko-Pastur 

Wigner matrices first emerged in the literature on nuclear physics, where they were used to 
model the Hamiltonians of reactions involving heavy atoms [Meh04, §1.1]. Wigner [Wig55] showed 
that the limiting spectral distribution of a certain type of Wigner matrix follows the semicircle 
law. See the book [Taol2, §2.4] of Tao for an overview and the book [BS10, Chap. 2] of Bai & 
Silverstein for a complete treatment. The Bai-Yin law [BY93] states that, up to scaling, the max¬ 
imum eigenvalue of a Wigner matrix converges almost surely to two. See [Taol2, §2.3] or [BS10, 
Chap. 5] for more information. The analysis of the Gaussian Wigner matrix that we present here, 
using Theorem 4.6.1, is drawn from [Trollc, §4]. 

The first rigorous work on a rectangular Gaussian matrix is due to Marcenko & Pastur [MP67], 
who established that the limiting distribution of the squared singular values follows a distribu¬ 
tion that now bears their names. The Bai-Yin law [BY93] gives an almost-sure limit for the largest 
singular value of a rectangular Gaussian matrix. The expectation bound (4.2.5) appears in a sur¬ 
vey article [DS02] by Davidson & Szarek. The latter result is ultimately derived from a comparison 
theorem for Gaussian processes due to Fernique [Fer75] and amplified by Gordon [Gor85]. Our 
approach, using Theorem 4.1.1 , is based on [Tro 11 c, §4]. 

4.7.5 Randomly Signed Matrices 

Matrices with randomly signed entries have not received much attention in the literature. The 
result (4.3.1) is due to Yoav Seginer [SegOO]. There is also a well-known paper [Lat05] by Rafal 
Latala that provides a bound for the expected norm of a Gaussian matrix whose entries have 
nonuniform variance. Riemer & Schiitt [RS13] have extended the earlier results. The very re¬ 
cent paper [BV14] of Afonso Bandeira and Ramon Van Handel contains an elegant new proof of 
Seginer’s result based on a general theorem for random matrices with independent entries. The 
analysis here, using Theorem 4.1.1, is drawn from [Trollc, §4]. 

4.7.6 Gaussian Toeplitz Matrices 

Research on random Toeplitz matrices is surprisingly recent, but there are now a number of pa¬ 
pers available. Bryc, Dembo, & Jiang obtained the limiting spectral distribution of a symmetric 
Toeplitz matrixbased on independent and identically distributed (iid) random variables [BDJ06]. 
Later, Mark Meckes established the first bound for the expected norm of a random Toeplitz ma¬ 
trix based on iid random variables [Mec07] . More recently, Sen & Virag computed the limiting 
value of the expected norm of a random, symmetric Toeplitz matrix whose entries have identical 
second-order statistics [SV13]. See the latter paper for additional references. The analysis here, 
based on Theorem 4.1.1, is new. Our lower bound for the value of E||r^|| follows from the re¬ 
sults of Sen & Virag. We are not aware of any analysis for a random Toeplitz matrix whose entries 
have different variances, but this type of result would follow from a simple modification of the 
argument in §4.4. 

4.7.7 Relaxation and Rounding of MaxQP 

The idea of using semidehnite relaxation and rounding to solve the MaxQP problem is due to 
Arkadi Nemirovski [Nem07]. He obtained nontrivial results on the performance of his method 
using some matrix moment calculations, but he was unable to reach the sharpest possible bound. 
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Anthony So [So09] pointed out that matrix moment inequalities imply an optimal result; he also 
showed that matrix concentration inequalities have applications to robust optimization. The 
presentation here, using Theorem 4.1.1, is essentially equivalent with the approach in [So09], 
but we have achieved slightly better bounds for the constants. 



A Sum of Random 
Positive-Semidefinite Matrices 


This chapter presents matrix concentration inequalities that are analogous with the classical 
Chernoff bounds. In the matrix setting, Chernoff-type inequalities allow us to control the ex¬ 
treme eigenvalues of a sum of independent, random, positive-semidefinite matrices. 

More formally, we consider a finite sequence {X^} of independent, random Hermitian matri¬ 
ces that satisfy 

0<A m i n (Xj;) and A max (AT) < L for each index 1c. 

Introduce the sum F - Y.k^-k- Our goal is to study the expectation and tail behavior of A max (F) 
and A m i n (F). Bounds on the maximum eigenvalue A max (F) give us information about the norm 
of the matrix F, a measure of how much the action of the matrix can dilate a vector. Bounds 
for the minimum eigenvalue A m i n (F) tell us when the matrix F is nonsingular; they also provide 
evidence about the norm of the inverse F -1 , when it exists. 

The matrix Chernoff inequalities are quite powerful, and they have numerous applications. 
We demonstrate the relevance of this theory by considering two examples. First, we show how 
to study the norm of a random submatrix drawn from a fixed matrix, and we explain how to 
check when the random submatrix has full rank. Second, we develop an analysis to determine 
when a random graph is likely to be connected. These two problems are closely related to basic 
questions in statistics and in combinatorics. 

In contrast, the matrix Bernstein inequalities, appearing in Chapter 6, describe how much 
a random matrix deviates from its mean value. As such, the matrix Bernstein bounds are more 
suitable than the matrix Chernoff bounds for problems that concern matrix approximations. 
Matrix Bernstein inequalities are also more appropriate when the variance i/(F) is small in com¬ 
parison with the upper bound L on the summands. 

Overview 

Section 5.1 presents the main results on the expectations and the tails of the extreme eigenvalues 
of a sum of independent, random, positive-semidefinite matrices. Section 5.2 explains how the 
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matrix Chernoff bounds provide spectral information about a random submatrix drawn from 
a fixed matrix. In §5.3, we use the matrix Chernoff bounds to study when a random graph is 
connected. Afterward, in §5.4 we explain how to prove the main results. 


5.1 The Matrix Chernoff Inequalities 

In the scalar setting, the Chernoff inequalities describe the behavior of a sum of independent, 
nonnegative random variables that are subject to a uniform upper bound. These results are often 
applied to study the number Y of successes in a sequence of independent—but not necessarily 
identical—Bernoulli trials with small probabilities of success. In this case, the Chernoff bounds 
show that Y behaves like a Poisson random variable. The random variable Y concentrates near 
the expected number of successes. Its lower tail has Gaussian decay, while its upper tail drops 
off faster than that of an exponential random variable. See [BLM13, §2.2] for more background. 

In the matrix setting, we encounter similar phenomena when we consider a sum of indepen¬ 
dent, random, positive-semidefinite matrices whose eigenvalues meet a uniform upper bound. 
This behavior emerges from the next theorem, which closely parallels the scalar Chernoff theo¬ 
rem. 

Theorem 5.1.1 (Matrix Chernoff). Consider a finite sequence \XC of independent, random, Her- 
mitian matrices with common dimension d. Assume that 


0 < Amin(-Xjfc) and A. max (X0 < L for each index k. 

Introduce the random matrix 

Define the minimum eigenvalue p m \ n and maximum eigenvalue p max of the expectation 

Pmin — Amin(E F) — A m in E-Xfc) > and 
Pm ax — A max (E Y) - A max (Y. EXT) • 


Then, ford > 0, 


EA min (F)> 

EA max (F)< 


1 - e 


-0 


a Pmin Q Llogd, and 
t) u 


e d -1 


g Pmax+gilogd. 


Furthermore, 


P {Amin ( Y) < (1 £)pmin} — d 
P {Amax (Y) 5 (1 + £)p max} — d 


(1 - e) 1_£ 


,(l + c) 1+£ 

The proof of Theorem 5.1.1 appears below in §5.4. 


EF: 

(5.1.1) 

(5.1.2) 


(5.1.3) 

(5.1.4) 


Mmin/ L 

fore e[0, 1), and 

(5.1.5) 

H max/ L 

fore > 0. 

(5.1.6) 


5.1.1 Discussion 


Let us consider some facets of Theorem 5.1.1. 
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Aspects of the Matrix Chernoff Inequality 

In many situations, it is easier to work with streamlined versions of the expectation bounds: 

EAminCy) >0.63/i m i n -Llogd, and (5.1.7) 

EA max (F) < 1.72/i max + Llogd. (5.1.8) 

We obtain these results by selecting 0 — 1 in both (5.1.3) and (5.1.4) and evaluating the numerical 
constants. 

These simplifications also help to clarify the meaning of Theorem 5.1.1. On average, A m i n (F) 
is not much smaller than A m i n (E Y), minus a fluctuation term that reflects the maximum size L 
of a summand and the ambient dimension d. Similarly, the average value of A max (F) is close to 
Amax(E F), plus the same fluctuation term. 

We can also weaken the tail bounds (5.1.5) and (5.1.6) to reach 

P {Amin (F) < t/i m in} < de _(1_r) ^ min/2i for t e [0,1), and 

, , f e \ 

F {Amax (F) > tpmax} -dy-J for t > e. 

The first bound shows that the lower tail of A m in(F) decays at a subgaussian rate with variance 
LIR min- The second bound manifests that the upper tail of A max (F) decays faster than that of an 
exponential random variable with mean L//i max . This is the same type of prediction we receive 
from the scalar Chernoff inequalities. 

As with other matrix concentration results, the tail bounds (5.1.5) and (5.1.6) can overesti¬ 
mate the actual tail probabilities for the extreme eigenvalues of F, especially at large deviations 
from the mean. The value of the matrix Chernoff theorem derives from the estimates (5.1.3) 
and (5.1.4) for the expectation of the minimum and maximum eigenvalue of F. Scalar concen¬ 
tration inequalities may provide better estimates for tail probabilities. 

Related Results 

We can moderate the dimensional factor d in the bounds for A max (F) from Theorem 5.1.1 when 
the random matrix F has limited spectral content in most directions. We take up this analysis in 
Chapter 7. 

Next, let us present an important refinement [CGT12a, Thm. A.l] of the bound (5.1.8) that 
can be very useful in practice: 

EA max (F) <2p max + 8e(Emax t A max (X fc ))logd. (5.1.9) 

This estimate maybe regarded as a matrix version of Rosenthal’s inequality [Ros70]. Observe that 
the uniform bound L appearing in (5.1.8) always exceeds the large parenthesis on the right-hand 
side of (5.1.9). Therefore, the estimate (5.1.9) is valuable when the summands are unbounded 
and, especially, when they have heavy tails. See the notes at the end of the chapter for more 
information. 

5.1.2 Optimality of the Matrix Chernoff Bounds 

In this section, we explore how well bounds such as Theorem 5.1.1 and inequality (5.1.9) de¬ 
scribe the behavior of a random matrix F formed as a sum of independent, random positive- 
semidefinite matrices. 
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The Upper Chernoff Bounds 

We will demonstrate that both terms in the matrix Rosenthal inequality (5.1.9) are necessary. 
More precisely, 

const - [/imax + E maxj; A max (Xfc)] £ EA max (F) 

(5.1.10) 

< Const - [p max + (Emax fc A max (X fc )Jlog(iJ. 

Therefore, we have identified appropriate parameters for bounding EA max (F), although the con¬ 
stants and the logarithm may not be sharp in every case. 

The appearance of p rnfix on the left-hand side of (5.1.10) is a consequence of Jensen’s inequal¬ 
ity. Indeed, the maximum eigenvalue is convex, so 

EAmax(F) ’ A max (E F) = Pmax- 

To justify the other term, apply the fact that the summands X k are positive semidefmite to con¬ 
clude that 

EA max (F) = EA 

max (Lfc X -0- Emax fcA max iX k ). 

We have used the fact that A max (A + H) > A max (A) whenever H is positive semidefmite. Average 
the last two displays to develop the left-hand side of (5.1.10). The right-hand side of (5.1.10) is 
obviously just (5.1.9). 

A simple example suffices to show that the logarithm cannot always be removed from the 
second term in (5.1.8) or from (5.1.9). For each natural number n, consider the rfxd random 
matrix 

;=i k =i 

where { 8 is an independent family of bernoulli(« _1 ) random variables and E^- is the d x d 
matrix with a one in the (k, k) entry an zeros elsewhere. An easy application of (5.1.8) delivers 

A max (E„)<1.72 + logd. 

Using the Poisson limit of a binomial random variable and the Skorokhod representation, we can 
construct an independent family {QU of POISSON(I) random variables for which 

d 

Y n —► Qi t E kk almost surely as n —► oo. 

k= t 

It follows that 

logd 

EA max (F„) —► E maxt Q k « const - -— -- asu^oo. 

loglogd 

Therefore, the logarithm on the second term in (5.1.8) cannot be reduced by a factor larger than 
the iterated logarithm loglogcf. This modest loss comes from approximations we make when 
developing the estimate for the mean. The tail bound (5.1.6) accurately predicts the order of 
A max ( Y n ) in this example. 

The latter example depends on the commutativity of the summands and the infinite divisibil¬ 
ity of the Poisson distribution, so it may seem rather special. Nevertheless, the logarithm really 
does belong in many (but not all!) examples that arise in practice. In particular, it is necessary in 
the application to random submatrices in §5.2. 
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The Lower Chernoff Bounds 

The upper expectation bound (5.1.4) is quite satisfactory, but the situation is murkier for the 
lower expectation bound (5.1.3). The mean term appears naturally in the lower bound: 

E-^min (F) < A m ; n (E Y ) — Pm in. 

This estimate is a consequence of fensen’s inequality and the concavity of the minimum eigen¬ 
value. On the other hand, it is not clear what the correct form of the second term in (5. 1.3) should 
be for a general sum of random positive-semidefinite matrices. 

Nevertheless, a simple example demonstrates that the lower Chernoff bound (5.1.3) is nu¬ 
merically sharp in some situations. Let X be a d x d random positive-semidefinite matrix that 
satisfies 

X — dEa with probability d~ l for each index i=l,...,d. 

It is clear that EX = 1^. Form the random matrix 

n 

Y n - ^ Xfc where each X/ c is an independent copy of X. 

k= l 

The lower Chernoff bound (5.1.3) implies that 

1 - e~ 9 1 

EA min (F„) > —-- n--dlogd. 

o o 

The parameter 9 > 0 is at our disposal. This analysis predicts that E A m i n (F„) > 0 precisely when 
li > dlogd. 

On the other hand, A max (F„) > 0 if and only if each diagonal matrix d E; ( ; appears at least once 
among the summands X \,..., X n . To determine the probability that this event occurs, notice that 
this question is an instance of the coupon collector problem [MR95, §3.6]. The probability of 
collecting all d coupons within n draws undergoes a phase transition from about zero to about 
one at n - dlogd. By refining this argument [Trolld], we can verify that both lower Chernoff 
bounds (5.1.3) and (5.1.5) provide a numerically sharp lower bound for the value of n where the 
phase transition occurs. In other words, the lower matrix Chernoff bounds are themselves sharp. 

5.2 Example: A Random Submatrix of a Fixed Matrix 

The matrix Chernoff inequality can be used to bound the extreme singular values of a random 
submatrix drawn from a fixed matrix. Theorem 5.1.1 might not seem suitable for this purpose 
because it deals with eigenvalues, but we can connect the method with the problem via a simple 
transformation. The results in this section have found applications in randomized linear algebra, 
sparse approximation, machine learning, and other fields. See the notes at the end of the chapter 
for some additional discussion and references. 

5.2.1 A Random Column Submatrix 

Let B be a fixed d x n matrix, and let fo./. denote the /cth column of this matrix. The matrix can be 
expressed as the sum of its columns: 

B = E h -k e k- 

k =1 
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The symbol e k refers to the standard basis (column) vector with a one in the /cth component and 
zeros elsewhere; the length of the vector e k is determined by context. 

We consider a simple model for a random column submatrix. Let {d/ c } be an independent 
sequence of BERNOULLi(p/ lT) random variables. Define the random matrix 

Z= 

fc=i 


That is, we include each column independently with probability phi, which means that there 
are typically about p nonzero columns in the matrix. We do not remove the other columns; we 
just zero them out. 

In this section, we will obtain bounds on the expectation of the extreme singular values cri (Z) 
and (T r i (Z) of the d x n random matrix Z. More precisely, 


Ecti(Z) 2 < 1.72- — • + (logd)• maxj; \\b- k \\ 2 , and 

n 

Ecr d (Z) 2 > 0.63- - -cr d {B) 2 - (logd) -max fc llfo^ll 2 . 
n 


(5.2.1) 


That is, the random submatrix Z gets its “fair share” of the squared singular values of the original 
matrix B. There is a fluctuation term that depends on largest norm of a column of B and the 
logarithm of the number d of rows in B. This result is very useful because a positive lower bound 
on crrf(Z) ensures that the rows of the random submatrix Z are linearly independent. 


The Analysis 

To study the singular values of Z, it is convenient to define a dx d random, positive-semidefinite 
matrix 

Y - ZZ* - £ SjS k {bje*)(e k b* k ) - £ 6 k b :k b* k . 
j,k=\ k= 1 

Note that S 2 k — S k because S k only takes the values zero and one. The eigenvalues of Y determine 
the singular values of Z, and vice versa. In particular, 

A max (F) = A max (ZZ*) = £j 1 (Z) 2 and A min (F) = A min (ZZ*) = o d (Z) 2 , 

where we arrange the singular values of Z in weakly decreasing order a i (Z) > • • • > a r i (Z). 

The matrix Chernoff inequality provides bounds for the expectations of the eigenvalues of Y. 
To apply the result, first calculate 

EF = £ (E5 fc ) b k b* k = - £ b. k b* k — — ■ BB*, 

k =1 n k=l n 

so that 

Mmax —-^maix(EF)= (Ji(B)“ and p m i n — A m i n (E F) = Ud(B) . 

n n 

Define L = max^ : \\h- k \\ z , and observe that \\8 k b :k b* k \\ < L for each index k. The simplified matrix 
Chernoff bounds (5.1.7) and (5.1.8) now deliver the result (5.2.1). 
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5.2.2 A Random Row and Column Submatrix 

Next, we consider a model for a random set of rows and columns drawn from a fixed dx n matrix 
B. In this case, it is helpful to use matrix notation to represent the extraction of a submatrix. 
Define independent random projectors 

P = diag(<5i,...,<5 d ) and R - diag(£i, 

where {djJ is an independent family of Bernoulli!/?/ d) random variables and jch-} is an inde¬ 
pendent family of BERNOULLijr/ n) random variables. Then 

Z — PBR 

is a random submatrix of B with about p nonzero rows and r nonzero columns. 

In this section, we will show that 

E ||Z || 2 < 3 • ^ • - • ||B|| 2 + 2• • max fc ||fo :fc || 2 

d n d 

r log n , , 

+ 2--maxj \\bj : \r + (logd)(logn) ■maxj : i c \bji c r. (5.2.2) 

The notations b j and b± refer to the j th row and fcth column of the matrix B, while hj j. is the 
( j, k) entry of the matrix. In other words, the random submatrix Z gets its share of the total 
squared norm of the matrix B. The fluctuation terms reflect the maximum row norm and the 
maximum column norm of B, as well as the size of the largest entry. There is also a weak depen¬ 
dence on the ambient dimensions d and n. 


The Analysis 


The argument has much in common with the calculations for a random column submatrix, but 
we need to do some extra work to handle the interaction between the random row sampling and 
the random column sampling. 

To begin, we express the squared norm ||Z|| 2 in terms of the maximum eigenvalue of a ran¬ 
dom positive-semidefinite matrix: 


E IIZ|| 2 = EA max ((PBB)(PBB)*) 


= EA max ((PB)B(PB)*) = E 




E 

Zfmax 


£f fc (PB) :fc (PB) : * 


We have used the fact that RR* = R, and the notation (PB)± refers to the /cth column of the 
matrix PB. Observe that the random positive-semidefinite matrix on the right-hand side has 
dimension d. Invoking the matrix Chernoff inequality (5.1.8), conditional on the choice of P, we 
obtain 

E ||Z|] 2 < 1.72 • - • EA max ((PB)(PB)*) + flogd) • Ernax* || (PB). fc || 2 . (5.2.3) 

n 

The required calculation is analogous with the one in the Section 5.2.1, so we omit the details. To 
reach a deterministic bound, we still have two more expectations to control. 

Next, we examine the term in (5.2.3) that involves the maxi mum eigenvalue: 


d 


LSjbj-bj: 

U=1 


t 


EA max ((PB)(PB)*) = EA max (B*PB) = EA max 












66 


CHAPTER 5. A SUM OF RANDOM POSITIVE-SEMIDEFINITE MATRICES 


The first identity holds because A max (CC*) = A max (C*C) for any matrix C, and PP = P. Observe 
that the random positive-semidefinite matrix on the right-hand side has dimension n, and apply 
the matrix Chernoff inequality (5.1.8) again to reach 

E A max ((PB)(PB)*) < 1.72 • ^ • A ma X (B*B) + (logw) • max,-1| b j: || 2 . (5.2.4) 

Recall that A max (B*B) = [|B|| 2 to simplify this expression slightly. 

Last, we develop a bound on the maximum column norm in (5.2.3). This result also follows 
from the matrix Chernoff inequality, but we need to do a little work to see why. There are more 
direct proofs, but this approach is closer in spirit to the rest of our proof. 

We are going to treat the maximum column norm as the maximum eigenvalue of a sum of 
independent, random diagonal matrices. Observe that 

d 

||(PB) :t |] 2 = Yj s j \ b jk\ 2 for each k - 1 
7=1 


Using this representation, we see that 


maxfcll [PB) ± || 2 = A n 


— A n 


L? = 1<5;'IM 2 


Y Sj diag(|b/i| 2 
U=i 


Iy=i Sj \b ]n \ 2 
’\bj„\ 2 ) ■ 


To activate the matrix Chernoff bound, we need to compute the two parameters that appear 
in (5.1.8). First, the uniform upper bound L satisfies 

L = ma Xj A max (diag (| bji | 2 ,..., | b]„ | 2 )) = max,- max* I b jk \ 2 . 


Second, to compute /i max , note that 

E £<5ydiag(|b / i| 2 ,...,|fr / „| 2 ) = -^-diag Y \bji\ 2 ,--->Y \ b jn\ 2 

7=i a V7=l 7=1 

= ^ - diag(||fo ; i II 2 ,..., \\b :n || 2 ). 

Take the maximum eigenvalue of this expression to reach 

p o 

/imax — ■ niaxj; || b-^W . 

a 

Therefore, the matrix Chernoff inequality implies 

Emax t ||(PB) :fc || 2 < 1 . 72-3 ■ max fc Ufo^ll 2 + (logrc) - max; k \b jk \ 2 . 
a 


(5.2.5) 


On average, the maximum squared column norm of a random submatrix PB with approximately 
p nonzero rows gets its share pi dot the maximum squared column norm of B, plus a fluctuation 
term that depends on the magnitude of the largest entry of B and the logarithm of the number n 
of columns. 

Combine the three bounds (5.2.3), (5.2.4), and (5.2.5) to reach the result (5.2.2). We have 
simplified numerical constants to make the expression more compact. 
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5.3 Application: When is an Erdos-Renyi Graph Connected? 

Random graph theory concerns probabilistic models for the interactions between pairs of ob¬ 
jects. One basic question about a random graph is to ask whether there is a path connecting 
every pair of vertices or whether there are vertices segregated in different parts of the graph. It 
is possible to address this problem by studying the eigenvalues of random matrices, a challenge 
that we take up in this section. 

5.3.1 Background on Graph Theory 

Recall that an undirected graph is a pair G = ( V, E) . The elements of the set V are called vertices. 
The set £ is a collection of unordered pairs {u, v] of distinct vertices, called edges. We say that the 
graph has an edge between vertices u and v in V if the pair {u, v} appears in E. For simplicity, we 
assume that the vertex set V = {1,..., n}. The degree deg(fc) of the vertex k is the number of edges 
in E that include the vertex k. 

There are several natural matrices associated with an undirected graph. The adjacency ma¬ 
trix of the graph G is an n x n symmetric matrix A whose entries indicate which edges are present: 


1, {j, k}€ E 
0, {j,k}€E. 


We have assumed that edges connect distinct vertices, so the diagonal entries of the matrix A 
equal zero. Next, define a diagonal matrix D — diag(deg(l),..., deg(n)) whose entries list the de¬ 
grees of the vertices. The Laplacian A and normalized Laplacian H of the graph are the matrices 

A - D-A and H = D~ 1I2 AD~ 112 . 

We place the convention that D~ ll2 [k, k) - 0 when deg(fc) = 0. The Laplacian matrix A is always 
positive semidehnite. The vector e e R" of ones is always an eigenvector of A with eigenvalue 
zero. 

These matrices and their spectral properties play a dominant role in modern graph theory. 
For example, the graph G is connected if and only if the second-smallest eigenvalue of A is 
strictly positive. The second smallest eigenvalue of H controls the rate at which a random walk 
on the graph G converges to the stationary distribution (under appropriate assumptions). See 
the book [GR01] for more information about these connections. 

5.3.2 The Model of Erdos & Renyi 

The simplest possible example of a random graph is the independent model G[n, p) of Erdos and 
Renyi [ER60]. The number n is the number of vertices in the graph, and p e (0,1) is the probabil¬ 
ity that two vertices are connected. More precisely, here is how to construct a random graph in 
G(n, p). Between each pair of distinct vertices, we place an edge independently at random with 
probability p. In other words, the adjacency matrix takes the form 



(5.3.1) 
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An Erdos-Renyi graph in G(100, 0.1) 



0 10 20 30 40 50 60 70 80 90 100 

nz = 972 


Figure 5.1: The adjacency matrix of an Erdos-Renyi graph. This figure shows the pattern of 
nonzero entries in the adjacency matrix A of a random graph drawn from G(100,0.1). Out of 
a possible 4,950 edges, there are 486 edges present. A basic question is whether the graph is 
connected. The graph is disconnected if and only if there is a permutation of the vertices so 
that the adjacency matrix is block diagonal. This property is reflected in the second-smallest 
eigenvalue of the Laplacian matrix A. 


The family : 1 < j < k < n} consists of mutually independent Bernoulli (p) random vari¬ 
ables. Figure 5.3.2 shows one realization of the adjacency matrix of an Erdos-Renyi graph. 

Let us explain how to represent the adjacency matrix and Laplacian matrix of an Erdos-Renyi 
graph as a sum of independent random matrices. The adjacency matrix A of a random graph in 
G[n, p) can be written as 

A= £ S jk (E jk + E k j). (5.3.2) 

1 <j<k<n 


This expression is a straightforward translation of the definition (5.3.1) into matrix form. Simi¬ 
larly, the Laplacian matrix A of the random graph can be expressed as 

A— ^ %jk (Eyy + Efcfc — Ejk — Efcj). (5.3.3) 

1< ;<fc<n 


To verify the formula (5.3.3), observe that the presence of an edge between the vertices j and k 
increases the degree of j and k by one. Therefore, when - 1, we augment the ( j, j ) and (/c, k) 
entries of A to reflect the change in degree, and we mark the ( /, k) and (k, /) entries with -1 to 
reflect the presence of the edge between j and k. 
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5.3.3 Connectivity of an Erdos-Renyi Graph 

We will obtain a near-optimal bound for the range of parameters where an Erdos-Renyi graph 
G(n, p ) is likely to be connected. We can accomplish this goal by showing that the second small¬ 
est eigenvalue of the nx n random Laplacian matrix A = D - A is strictly positive. We will solve 
the problem by using the matrix Chernoff inequality to study the second-smallest eigenvalue of 
the random Laplacian A. 

We need to form a random matrix Y that consists of independent positive-semidefinite terms 
and whose minimum eigenvalue coincides with the second-smallest eigenvalue of A. Our ap¬ 
proach is to compress the matrix Y to the orthogonal complement of the vector e of ones. To 
that end, we introduce an (n-1) x n partial isometry R that satisfies 

RR* - I n -i and Re = 0. (5.3.4) 


Now, consider the (n - 1) x [n - 1) random matrix 

Y = RAR* - £ $j k -R(E ]] + E kk -Ej k -E kJ )R*. (5.3.5) 

1 <j<fc<n 

Recall that {£,j k } is an independent family of Bernoulli (p) random variables, so the summands 
are mutually independent. The Conjugation Rule (2.1.12) ensures that each summand remains 
positive semidefinite. Furthermore, the Courant-Fischer theorem implies that the minimum 
eigenvalue of Y coincides with the second-smallest eigenvalue of A because the smallest eigen¬ 
value of A has eigenvector e. 

To apply the matrix Chernoff inequality, we show that L- 2 is an upper bound for the eigen¬ 
values of each summand in (5.3.4). We have 

Hjk ■ R (E jj + E kk - E jk - E k j)R* II < \( ]k \ • IIif || • ||E 77 + E kk - E jk - E k] || • \\R* || < 2. 

The first bound follows from the submultiplicativity of the spectral norm. To obtain the second 
bound, note that (j k takes 0-1 values. The matrix R is a partial isometry so its norm equals one. 
Finally, a direct calculation shows that T = Ejj + E kk -Ej k -E k j satisfies the polynomial T 2 -2T, 
so each eigenvalue of T must equal zero or two. 

Next, we compute the expectation of the matrix Y. 


E Y = p R 


JL (E j j + E kk - Ej k - E k j) 
l <j<k<n 


R* = p-R[{n- l)I„-(ee* -!„)]«* = pn- I„_i. 


The first identity follows when we apply linearity of expectation to (5.3.5) and then use linearity 
of matrix multiplication to draw the sum inside the conjugation by J?. The term [n- 1)I„ emerges 
when we sum the diagonal matrices. The term ee* - 1„ comes from the off-diagonal matrix units, 
once we note that the matrix ee* has one in each component. The last identity holds because of 
the properties of R displayed in (5.3.4). We conclude that 


'WnintE Y) — pn. 


This is all the information we need. 

To arrive at a probability inequality for the second-smallest eigenvalue A], (A) of the matrix A, 
we apply the tail bound (5.1.5) to the matrix Y. We obtain, for t e (0,1), 


5 |A^(A)< t-pn} = P ) {A m i n (F) < t-pn}< [n- 1) 


3 f-1 


pn/2 
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To appreciate what this means, we may think about the situation where t —► 0. Then the bracket 
tends to e _1 , and we see that the second-smallest eigenvalue of A is unlikely to be zero when 
logf/7 - 1) - pn/2 < 0. Rearranging this expression, we obtain a sufficient condition 

21og(w- 1) 

p> - 

n 

for an Erdos-Renyi graph Gin, p ) to be connected with high probability as n —► oo. This bound is 
quite close to the optimal result, which lacks the factor two on the right-hand side. It is possible 
to make this reasoning more precise, but it does not seem worth the fuss. 


5.4 Proof of the Matrix Chernoff Inequalities 

The first step toward the matrix Chernoff inequalities is to develop an appropriate semidefinite 
bound for the mgf and cgf of a random positive-semidefinite matrix. The method for establishing 
this result mimics the proof in the scalar case: we simply bound the exponential with a linear 
function. 


Lemma 5.4.1 (Matrix Chernoff: Mgf and Cgf Bound). Suppose that X is a random matrix that 
satisfies 0 < A m ; n (X) and A max (X) < L. Then 


E e BX =4 exp 


e BL -1 
-EX 


e BL _ i 

and logEe 0x ^-EX for 0 el 


Proof. Consider the function fix) = e Bx . Since / is convex, its graph lies below the chord con¬ 
necting any two points on the graph. In particular, 


fix) < /(0) + ^ L) ^ t0) • x for x e [0,L]. 


In detail, 


e Bx < 1+ 


e SL -1 


• x for x e [0,L]. 


By assumption, each eigenvalue of X lies in the interval [0,L]. Thus, the Transfer Rule (2.1.14) 
implies that 


e ex ^I + 


e BL -1 


L 


X. 


Expectation respects the semidefinite order, so 

• EX =<I exp 


Ee 0x ^I + 


e 0i -1 


e BL -1 


•EX 


The second relation is a consequence of the fact that I + A =<: e A for every matrix A, which we 
obtain by applying the Transfer Rule (2.1.14) to the inequality 1 + a < e a , valid for all a e P. 

To obtain the semidefinite bound for the cgf, we simply take the logarithm of the semidef¬ 
inite bound for the mgf. This operation preserves the semidefinite order because of the prop¬ 
erty (2.1.18) that the logarithm is operator monotone. □ 
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We break the proof of the matrix inequality into two pieces. First, we establish the bounds 
on the maximum eigenvalue, which are slightly easier. Afterward, we develop the bounds on the 
minimum eigenvalue. 

Proof of Theorem 5.1.1, Maximum Eigenvalue Bounds. Consider a finite sequence {XU of inde¬ 
pendent, random Hermitian matrices with common dimension d. Assume that 

0<A m i n (Xj;) and A max (X/c) < L for each index k. 

The cgf bound, Lemma 5.4.1, states that 

e BL_ i 

log ^e eXk =<: g(0) ■ EXj; where g(0) = ——— for0>O. (5.4.1) 

We begin with the upper bound (5.1.4) for EA max (F). Using the fact (2.1.16) that the trace of 
the exponential function is monotone with respect to the semidefinite order, we substitute these 
cgf bounds into the master inequality (3.6.1) for the expectation of the maximum eigenvalue to 
reach 

E A max (F) < inf ^ log trexp (g(0) ]T t EX fc ) 

£ M 1 10 8 \ dX max (exp (g(0) ■ E F))J 

0>O c/ 

= inf ~ log [dexp(A max (g(0)-EF))] 

= inf ^ log [d exp(g(0) • A max (EF))] 

= inf t: [logd + g(0) • /t m ax] • 

6> 0 o 

In the second line, we use the fact that the matrix exponential is positive definite to bound the 
trace by d times the maximum eigenvalue; we have also identified the sum as E F. The third line 
follows from the Spectral Mapping Theorem, Proposition 2.1.3. Next, we use the fact (2.1.4) that 
the maximum eigenvalue is a positive-homogeneous map, which depends on the observation 
that g(0) > 0 for 6 > 0. Finally, we identify the statistic /i max defined in (5.1.2). The infimum 
does not admit a closed form, but we can obtain the expression (5.1.4) by making the change of 
variables Q >-»■ 9/ L. 

Next, we turn to the upper bound (5.1.6) for the upper tail of the maximum eigenvalue. Sub¬ 
stitute the cgf bounds (5.4.1) into the master inequality (3.6.3) to reach 

P {A max (F) > f} < inf e~ et trexp (g(0) EX t ) 

< inf e~ Bt ■ d exp (g(0) • p max ). 

6> 0 

The steps here are identical with the previous argument. To complete the proof, make the change 
of variables t >->• (1 + £)/i max - Then the infimum is achieved at 0 - L~ l log(l + £), which leads to 
the tail bound (5.1.6). □ 

The lower bounds follow from a related argument that is slightly more delicate. 
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Proof of Theorem 5.1.1, Minimum Eigenvalue Bounds. Once again, consider a finite sequence {X k \ 
of independent, random Hermitian matrices with dimension d. Assume that 

0 < AmjnCXfc) and A ma)t (-Xfc) < L for each index k. 

The cgf bound, Lemma 5.4.1, states that 


e 0L -l 

log Ee eXk =4 g(0) • EXfc where g[9) ———— for0<O. (5.4.2) 

Note that g(0) < 0 for 9 < 0, which alters a number of the steps in the argument. 

We commence with the lower bound (5.1.3) for EA m i n (F). As stated in (2.1.16), the trace 
exponential function is monotone with respect to the semidefinite order, so the master inequal¬ 
ity (3.6.2) for the minimum eigenvalue delivers 

E A m in ( Y ) > sup i log tr exp (g (0) £ k E X k j 

>sup \ log [dA max (exp(g(0)-EF))] 
e<o u 

— sup log [dexp(A max (g(0)-EF))] 
e<o v 

= sup ^ log [d exp (g(9) ' ^min (EF))] 

0<O V 

= sup [log d+ g(0) ■ p min ] . 

0<0 V 

Most of the steps are the same as in the proof of the upper bound (5.1.4), so we focus on the 
differences. Since the factor 9~ l in the first and second lines is negative, upper bounds on the 
trace reduce the value of the expression. We move to the fourth line by invoking the property 
Amax(ttA) = aA m i n (A) for a < 0, which follows from (2.1.4) and (2.1.5). This piece of algebra 
depends on the fact that g(0) < 0 when 9 < 0. To obtain the result (5.1.3), we change variables: 
0~-9IL. 

Finally, we establish the bound (5.1.5) for the lower tail of the minimum eigenvalue. Intro¬ 
duce the cgf bounds (5.4.2) into the master inequality (3.6.4) to reach 

IP {Amin (F) < t}< inf e~ 6t trexp[g(0)£ fc EX t ] 

< inf e~ 9t ■ d exp (g(0) • p min ). 

@<0 

The justifications here match those in with the previous argument. Finally, we make the change 
of variables t'—>- (1 — £)/i m ; n . The infimum is attained at 9 - IT 1 log(l - e), which yields the tail 
bound (5.1.5). □ 


5.5 Notes 

As usual, we continue with an overview of background references and related work. 
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5.5.1 Matrix Chernoff Inequalities 

Scalar Chernoff inequalities date to the paper [Che52, Thm. 1] by Herman Chernoff. The original 
result provides probability bounds for the number of successes in a sequence of independent but 
non-identical Bernoulli trials. Chernoff’s proof combines the scalar Laplace transform method 
with refined bounds on the mgf of a Bernoulli random variable. It is very common to encounter 
simplified versions of Chernoff’s result, such as [Lug09, Exer. 8] or [MR95, §4.1], 

In their paper [AW02], Ahlswede & Winter developed a matrix version of the Chernoff in¬ 
equality. The matrix mgf bound, Lemma 5.4.1, essentially appears in their work. Ahlswede & 
Winter focus on the case of independent and identically distributed random matrices, in which 
case their results are roughly equivalent with Theorem 5.1.1. For the general case, their approach 
leads to matrix expectation statistics of the form 

MmYn^L^mintEXfc) and /CL = Lk Vax(EXfc). 

It is clear that their /CL may be substantially smaller than the quantity we defined in The¬ 
orem 5.1.1. Similarly, their /CL ma Y t> e substantially larger than the quantity /i max that drives 
the upper Chernoff bounds. 

The tail bounds from Theorem 5.1.1 are drawn from [Trollc, §5], but the expectation bounds 
we present are new. The technical report [GT14] extends the matrix Chernoff inequality to pro¬ 
vide upper and lower tail bounds for all eigenvalues of a sum of random, positive-semidefinite 
matrices. Chapter 7 contains a slight improvement of the bounds for the maximum eigenvalue 
in Theorem 5.1.1. 

Let us mention a few other results that are related to the matrix Chernoff inequality. First, 
Theorem 5.1.1 has a lovely information-theoretic formulation where the tail bounds are stated 
in terms of an information divergence. To establish this result, we must restructure the proof and 
eliminate some of the approximations. See [AW02, Thm. 19] or [Trollc, Thm. 5.1]. 

Second, the problem of bounding the minimum eigenvalue of a sum of random, positive- 
semidefinite matrices has a special character. The reason, roughly, is that a sum of independent, 
nonnegative random variables cannot easily take the value zero. A closely related phenomenon 
holds in the matrix setting, and it is possible to develop estimates that exploit this observation. 
See [01il3, Thm. 3.1] and [KM13, Thm. 1.3] for two wildly different approaches. 

5.5.2 The Matrix Rosenthal Inequality 

The matrix Rosenthal inequality (5.1.9) is one of the earliest matrix concentration bounds. In 
his paper [Rud99], Rudelson used the noncommutative Khintchine inequality (4.7.1) to estab¬ 
lish a specialization of (5.1.9) to rank-one summands. A refinement appears in [RV07], and ex¬ 
plicit constants were first derived in [Tro08c]. We believe that the paper [CGT12a] contains the 
first complete statement of the moment bound (5.1.9) for general positive-semidefinite sum¬ 
mands; see also the work [MZ11]. The constants in [CGT12a, Thm. A.l], and hence in (5.1.9), can 
be improved slightly by using the sharp version of the noncommutative Khintchine inequality 
from [BucOl, Buc05]. Let us stress that all of these results follow from easy variations of Rudel- 
son’s argument. 

The work [MJC + 14, Cor. 7.4] provides a self-contained and completely elementary proof of 
a matrix Rosenthal inequality that is closely related to (5.1.9). This result depends on different 
principles from the works mentioned in the last paragraph. 
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5.5.3 Random Submatrices 

The problem of studying a random submatrix drawn from a fixed matrix has a long history. 
An early example is the paving problem from operator theory, which asks for a maximal well- 
conditioned set of columns (or a well-conditioned submatrix) inside a fixed matrix. Random se¬ 
lection provides a natural way to approach this question. The papers of Bourgain & Tzafriri [BT87, 
BT91] and Kashin & Tzafriri [KT94] study random paving using sophisticated tools from func¬ 
tional analysis. See the paper [NT14] for a summary of research on randomized methods for 
constructing pavings. Very recently, Adam Marcus, Dan Spielman, & Nikhil Srivastava [MSS14] 
have solved the paving problem completely. 

Later, Rudelson and Vershynin [RV07] showed that the noncommutative Khintchine inequal¬ 
ity provides a clean way to bound the norm of a random column submatrix (or a random row 
and column submatrix) drawn from a fixed matrix. Their ideas have found many applications 
in the mathematical signal processing literature. For example, the paper [Tro08a] uses similar 
techniques to analyze the perfomance of (\ minimization for recovering a random sparse signal. 
The same methods support the paper [Tro08c], which contains a modern proof of the random 
paving result [BT91, Thm. 2.1] of Bourgain & Tzafriri. 

The article [Trolld] contains the observation that the matrix Chernoff inequality is an ideal 
tool for studying random submatrices. It applies this technique to study a random matrix that 
arises in numerical linear algebra [HMT11] , and it achieves an optimal estimate for the minimum 
singular value of the random matrix that arises in this setting. Our analysis of a random column 
submatrix is based on this work. The analysis of a random row and column submatrix is new. 
The paper [CD12], by Chretien andDarses, uses matrix Chernoff bounds in a more sophisticated 
way to develop tail bounds for the norm of a random row and column submatrix. 

5.5.4 Random Graphs 

The analysis of random graphs and random hypergraphs appeared as one of the earliest applica¬ 
tions of matrix concentration inequalities [AW02] . Christofides and Markstrom developed a ma¬ 
trix Hoeffding inequality to aid in this purpose [CM08] . Later, Oliveira wrote two papers [OlilOa, 
Olil 1] on random graph theory based on matrix concentration. We recommend these works for 
further information. 

To analyze the random graph Laplacian, we compressed the Laplacian to a subspace so that 
the minimum eigenvalue of the compression coincides with the second-smallest eigenvalue of 
the original Laplacian. This device can be extended to obtain tail bounds for all the eigenvalues 
of a sum of independent random matrices. See the technical report [GT14] for a development of 
this idea. 



A Sum of Bounded 
Random Matrices 


In this chapter, we describe matrix concentration inequalities that generalize the classical Bern¬ 
stein bound. The matrix Bernstein inequalities concern a random matrix formed as a sum of 
independent, random matrices that are bounded in spectral norm. The results allow us to study 
how much this type of random matrix deviates from its mean value in the spectral norm. 

Formally, we consider an finite sequence {S^} of random matrices of the same dimension. 
Assume that the matrices satisfy the conditions 

ESfc = 0 and ||Sfc||<L for each index k. 

Form the sum Z -Y.kSk- The matrix Bernstein inequality controls the expectation and tail be¬ 
havior of ||Z|| in terms of the matrix variance statistic v[Z) and the uniform bound L. 

The matrix Bernstein inequality is a powerful tool with a huge number of applications. In 
these pages, we can only give a coarse indication of how researchers have used this result, so we 
have chosen to focus on problems that use random sampling to approximate a specified matrix. 
This model applies to the sample covariance matrix in the introduction. In this chapter, we out¬ 
line several additional examples. First, we consider the technique of randomized sparsihcation, 
in which we replace a dense matrix with a sparse proxy that has similar spectral behavior. Sec¬ 
ond, we explain how to develop a randomized algorithm for approximate matrix multiplication, 
and we establish an error bound for this method. Third, we develop an analysis of random fea¬ 
tures, a method for approximating kernel matrices that has become popular in contemporary 
machine learning. 

As these examples suggest, the matrix Bernstein inequality is very effective for studying ran¬ 
domized approximations of a given matrix. Nevertheless, when the matrix Chernoff inequality, 
Theorem 5.1.1, happens to apply to a problem, it often delivers better results. 
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Overview 


Section 6.1 describes the matrix Bernstein inequality. Section 6.2 explains how to use the Bern¬ 
stein inequality to study randomized methods for matrix approximation. In §§6.3, 6.4, and 6.5, 
we apply the latter result to three matrix approximation problems. We conclude with the proof 
of the matrix Bernstein inequality in §6.6. 

6.1 A Sum of Bounded Random Matrices 

In the scalar setting, the label “Bernstein inequality” applies to a very large number of concen¬ 
tration results. Most of these bounds have extensions to matrices. For simplicity, we focus on the 
most famous of the scalar results, a tail bound for the sum Z of independent, zero-mean random 
variables that are subject to a uniform bound. In this case, the Bernstein inequality shows that Z 
concentrates around zero. The tails of Z make a transition from subgaussian decay at moderate 
deviations to subexponential decay at large deviations. See [BLM13, §2.7] for more information 
about Bernstein’s inequality. 

In analogy, the simplest matrix Bernstein inequality concerns a sum of independent, zero- 
mean random matrices whose norms are bounded above. The theorem demonstrates that the 
norm of the sum acts much like the scalar random variable Z that we discussed in the last para¬ 
graph. 

Theorem 6.1.1 (Matrix Bernstein). Consider a finite sequence {Stl of independent, random ma¬ 
trices with common dimension d\ x(i 2 - Assume that 


E S/c — 0 and || || < L for each index k. 


Introduce the random matrix 


z = L k *k. 


Let v{Z) he the matrix variance statistic of the sum: 


v[Z) = max] ||E(ZZ*)||, ||E(Z*Z)||} 

= max{|I t E(S*sa|, ILt E ( S ^)|}- 


( 6 . 1 . 1 ) 

( 6 . 1 . 2 ) 


Then 



(6.1.3) 


Furthermore, for all t > 0, 


P{||Z|| > f} < {di + d 2 ) exp 


v{Z) + Lt! 3 


(6.1.4) 


The proof of Theorem 6.1.1 appears in §6.6. 


6.1.1 Discussion 

Let us spend a few moments to discuss the matrix Bernstein inequality, Theorem 6.1.1, its con¬ 
sequences, and some of the improvements that are available. 
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Aspects of the Matrix Bernstein Inequality 

First, observe that the matrix variance statistic u(Z) appearing in (6.1.1) coincides with the gen¬ 
eral definition (2.2.8) because Z has zero mean. To reach (6.1.2), we have used the additivity 
law (2.2.11) for an independent sum to express the matrix variance statistic in terms of the sum¬ 
mands. Observe that, when the summands S/- are Hermitian, the two terms in the maximum 
coincide. 

The expectation bound (6.1.3) shows that E||Z|| is on the same scale as the root y/v(Z) of the 
matrix variance statistic and the upper bound L for the summands; there is also a weak depen¬ 
dence on the ambient dimension d. In general, all three of these features are necessary. Nev¬ 
ertheless, the bound may not be very tight for particular examples. See Section 6.1.2 for some 
evidence. 

Next, let us explain how to interpret the tail bound (6.1.4). The main difference between this 
result and the scalar Bernstein bound is the appearance of the dimensional factor d\ + d 2 , which 
reduces the range of t where the inequality is informative. To get a better idea of what this result 
means, it is helpful to make a further estimate: 



(6.1.5) 


In other words, for moderate values of t, the tail probability decays as fast as the tail of a Gaussian 
random variable whose variance is comparable with u[Z). For larger values of f, the tail proba¬ 
bility decays at least as fast as that of an exponential random variable whose mean is comparable 
with L. As usual, we insert a warning that the tail behavior reported by the matrix Bernstein in¬ 
equality can overestimate the actual tail behavior. 

Last, it is helpful to remember that the matrix Bernstein inequality extends to a sum of un¬ 
centered random matrices. In this case, the result describes the spectral-norm deviation of the 
random sum from its mean value. For reference, we include the statement here. 

Corollary 6.1.2 (Matrix Bernstein: Uncentered Summands). Consider a finite sequence {S^} of 
independent random matrices with common dimension d\ x d 2 . Assume that each matrix has 
uniformly bounded deviation from its mean: 


II St ~ E Sic || < L for each index Ic. 

Introduce the sum 


Z = Lit St¬ 


and let v[Z) denote the matrix variance statistic of the sum: 


v(Z) = max {|| E [ (Z - E Z) (Z - E Z) * ] ||, || E [ (Z - E Z) * (Z - E Z) ] ||} 

= max {|| £IE [ (S fc - E S k ) (S k - E S k Y *] ||, || £ k E [ {S k - E S k ) * (S fc - E S k )] ||}. 


Then 



Furthermore, for all t > 0, 



This result follows as an immediate corollary of Theorem 6.1.1. 
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Related Results 

The bounds in Theorem 6.1.1 are stated in terms of the ambient dimensions d\ and cfe of the 
random matrix Z. The dependence on the ambient dimension is not completely natural. For 
example, consider embedding the random matrix Z into the top corner of a much larger ma¬ 
trix which is zero everywhere else. It turns out that we can achieve results that reflect only the 
“intrinsic dimension” of Z. We turn to this analysis in Chapter 7. 

In addition, there are many circumstances where the uniform upper bound L that appears 
in (6.1.3) does not accurately reflect the tail behavior of the random matrix. For instance, the 
summands themselves may have very heavy tails. In such emergencies, the following expecta¬ 
tion bound [CGT12a, Thm. A.l] can be a lifesaver. 

(EIIZ|| 2 ) 1/2 < yj2ev{Z)\og{di + d 2 )+ 4e(Emax fc IIS*II 2 ) 1/2 log(<^i + d 2 ). (6.1.6) 

This result is a matrix formulation of the Rosenthal-Pinelis inequality [Pin94, Thm. 4.1]. 

Finally, let us reiterate that there are other types of matrix Bernstein inequalities. For ex¬ 
ample, we can sharpen the tail bound (6.1.4) to obtain a matrix Bennett inequality. We can also 
relax the boundedness assumption to a weaker hypothesis on the growth of the moments of each 
summand S k - In the Hermitian setting, the result can also discriminate the behavior of the upper 
and lower tails, which is a consequence of Theorem 6.6.1 below. See the notes at the end of this 
chapter and the annotated bibliography for more information. 

6.1.2 Optimality of the Matrix Bernstein Inequality 

To use the matrix Bernstein inequality, Theorem 6.1.1, and its relatives with intelligence, one 
must appreciate their strengths and weaknesses. We will focus on the matrix Rosenthal-Pinelis 
inequality (6.1.6). Nevertheless, similar insights are relevant to the estimate (6.1.3). 

The Expectation Bound 

Let us present lower bounds to demonstrate that the matrix Rosenthal-Pinelis inequality (6.1.6) 
requires both terms that appear. First, the quantity v(Z) cannot be omitted because Jensen’s 
inequality implies that 

E ||Z|| 2 = Emax] ||ZZ* ||, ||Z*Z|| } > max] ||E(ZZ*)||, ||E(Z*Z)||} = v(Z). 

Under a natural hypothesis, the second term on the right-hand side of (6.1.6) also is essential. 
Suppose that each summand S k is a symmetric random variable; that is, S k and -S k have the 
same distribution. In this case, an involved argument [LT91, Prop. 6.10] leads to the bound 

E ||Z|| 2 > const-EmaXjtllSfcl] 2 . (6.1.7) 

There are examples where the right-hand side of (6.1.7) is comparable with the uniform upper 
bound L on the summands, but this is not always so. 

In summary, when the summands Sk are symmetric, we have matching estimates 

const- [v^ + (Emax t ||S,|| 2 ) 1/2 ] < (E ||Z|| 2 ) 1/2 

< Const- [ \Jv{Z) log(di + d 2 ) + [E max t || S k || 2 ) 1/2 log(di + d 2 )]. 

We see that the bound (6.1.6) must include some version of each term that appears, but the 
logarithms are not always necessary. 
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Examples where the Logarithms Appear 

First, let us show that the variance term in (6.1.6) must contain a logarithm. For each natural 
number n, consider the d x d random matrix Z of the form 

1 n d 

Z n — —j=- Yj Qik^kk 
V n i=l k=l 

where {p,fcl is an independent family of Rademacher random variables. An easy application of 
the bound (6.1.6) implies that 

E || Z n || < Const • | \J log(2d) + —|=log(2d)j —* Const - ^/log(2d) as n — oo. 

Using the central limit theorem and the Skorokhod representation, we can construct an inde¬ 
pendent family {y/J of standard normal random variables for which 

d 

Z n —>• Y Yic Ekk almost surely as n —► oo. 
k= l 


But this fact ensures that 


= E max*; |yj;| « y 21ogd as n —► oo. 

Therefore, we cannot remove the logarithm from the variance term in (6.1.6). 

Next, let us justify the logarithm on the norm of the summands in (6.1.6). For each natural 
number n, consider a dx d random matrix Z of the form 

z » = £it [S^-n-')E kk 

i=lk=l 

where {d^ 1 } is an independent family of BERNOULLi(n _1 ) random variables. The matrix Rosenthal- 
Pinelis inequality (6.1.6) ensures that 

E II Z n II < Const • | yJ\og{2d) + log(2d) j. 

Using the Poisson limit of a binomial random variable and the Skorohod representation, we can 
construct an independent family {QO of POISSON(I) random variables for which 

d 

Z n —i- Y (Qfc - 1) Ejtfc almost surely as n —>• oo. 
k= l 


E ||Z„|| —*• E 


y Yk^kk 


k= 1 


Therefore, 

logd 

= E max*; \Q/ C - 1| « const-asn-^oo. 

loglogd 

In short, the bound we derived from (6.1.6) requires the logarithm on the second term, but it is 
suboptimal by a log log factor. The upper matrix Chernoff inequality (5.1.6) correctly predicts the 
appearance of the iterated logarithm in this example, as does the matrix Bennett inequality. 


E II Z„ || —► E 


Y (Qk~ l)Efcfc 


k= l 
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The last two examples rely heavily on the commutativity of the summands as well as the 
infinite divisibility of the normal and Poisson distributions. As a consequence, it may appear 
that the logarithms only appear in very special contexts. In fact, many (but not all!) examples 
that arise in practice do require the logarithms that appear in the matrix Bernstein inequality. 
It is a subject of ongoing research to obtain a simple criterion for deciding when the logarithms 
belong. 

6.2 Example: Matrix Approximation by Random Sampling 

In applied mathematics, we often need to approximate a complicated target object by a more 
structured object. In some situations, we can solve this problem using a beautiful probabilis¬ 
tic approach called empirical approximation. The basic idea is to construct a “simple” random 
object whose expectation equals the target. We obtain the approximation by averaging several 
independent copies of the simple random object. As the number of terms in this average in¬ 
creases, the approximation becomes more complex, but it represents the target more faithfully. 
The challenge is to quantify this tradeoff. 

In particular, we often encounter problems where we need to approximate a matrix by a more 
structured matrix. For example, we may wish to find a sparse matrix that is close to a given 
matrix, or we may need to construct a low-rank matrix that is close to a given matrix. Empirical 
approximation provides a mechanism for obtaining these approximations. The matrix Bernstein 
inequality offers a natural tool for assessing the quality of the randomized approximation. 

This section develops a general framework for empirical approximation of matrices. Subse¬ 
quent sections explain how this technique applies to specific examples from the fields of ran¬ 
domized linear algebra and machine learning. 

6.2.1 Setup 

Let B be a target matrix that we hope to approximate by a more structured matrix. To that end, 
let us represent the target as a sum of “simple” matrices: 

N 

B=Y, B i- ( 6 . 2 . 1 ) 

i= 1 

The idea is to identify summands with desirable properties that we want our approximation to 
inherit. The examples in this chapter depend on decompositions of the form (6.2.1). 

Along with the decomposition (6.2.1), we need a set of sampling probabilities: 

N 

Y^,Pi - 1 and Pi > 0 for i = 1,..., N. (6.2.2) 

/=l 

We want to ascribe larger probabilities to “more important” summands. Quantifying what “im¬ 
portant” means is the most difficult aspect of randomized matrix approximation. Choosing the 
right sampling distribution for a specific problem requires insight and ingenuity. 

Given the data (6.2.1) and (6.2.2), we may construct a “simple” random matrix R by sampling: 

R-pJ l Bj with probability Pi- (6.2.3) 

This construction ensures that R is an unbiased estimator of the target: E R = B. Even so, the 
random matrix R offers a poor approximation of the target B because it has a lot more structure. 
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To improve the quality of the approximation, we average n independent copies of the random 
matrix R. We obtain an estimator of the form 


1 « 

Rn — — Rk where each R k is an independent copy of R. 
n k= 1 


By linearity of expectation, this estimator is also unbiased: E R n - B. The approximation R n re¬ 
mains structured when the number n of terms in the approximation is small as compared with 
the number N of terms in the decomposition (6.2.1). 

Our goal is to quantify the approximation error as a function of the complexity n of the ap¬ 
proximation: 


E||Jf„-B|| < error(n). 


There is a tension between the total number n of terms in the approximation and the error 
error(n) the approximation incurs. In applications, it is essential to achieve the right balance. 

6.2.2 Error Estimate for Matrix Sampling Estimators 

We can obtain an error estimate for the approximation scheme described in Section 6.2.1 as an 
immediate corollary of the matrix Bernstein inequality, Theorem 6.1.1. 

Corollary 6.2.1 (Matrix Approximation by Random Sampling). Let B be a fixed d\ x d 2 matrix. 
Construct a d\ x d 2 random matrix R that satisfies 


ER-B and ||JJ||<Z,. 


Compute the per-sample second moment: 


m 2 (R) = max] ||E(i?R*)||, ||E(iTR)||}. 


(6.2.4) 


Form the matrix sampling estimator 


1 n 

R n = — V R/c where each R^ is an independent copy ofR. 
n k= 1 


Then the estimator satisfies 


Ell Rn -fill < 


2m 2 (-R) l°g(^i + di) 2Llog(d 1 + d 2 ) 


(6.2.5) 


n 


Furthermore, for all t > 0, 



( 6 . 2 . 6 ) 


Proof. Since R is an unbiased estimator of the target matrix B, we can write 


Z = R n -B=-^(R k -ER) = }^S k . 
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We have defined the summands Sj. = «~ 1 (- tR). These random matrices form an indepen¬ 
dent and identically distributed family, and each S ,t has mean zero. 

Now, each of the summands is subject to an upper bound: 

IIS*II S - (II R k \\ + IIEifII) < - (lliffcll + EIIBII) < —. 

n n n 

The first relation is the triangle inequality; the second is Jensen’s inequality. The last estimate 
follows from our assumption that ||R|| < L. 

To control the matrix variance statistic v[Z), first note that 


v{Z) - max 


Le(S*S*) 


k= 1 


LHs* k Sk) 


k= 1 


= n-max{||E(SiSj)||, ||E(SjSi)||}. 


The first identity follows from the expression (6.1.2) for the matrix variance statistic, and the 
second holds because the summands Sk are identically distributed. We may calculate that 

0 =$ E(Si Sjj) = n ~ 2 E [ (if - E if) (R - E R) * ] 

= n~ 2 [E{RR*) - (Eif)(Eif)*] =<: n~ 2 E{RR*). 


The first relation holds because the expectation of the random positive-semidefinite matrix Si Sjj 
is positive semidefinite. The first identity follows from the definition of Si and the fact that R\ 
has the same distribution as R. The second identity is a direct calculation. The last relation holds 
because (EJf)(EJf)* is positive semidefinite. As a consequence, 

||E(SiSj)||< -4l|E(flfl*)||. 

Likewise, 

l|E(S*Si)|| < -^2 ||E(Jf*Jf)||. 
n z 

In summary, 

v{Z)< -max{||E(J?Jf*)||, ||E(Jf*Jf)||} = 

n n 

The last line follows from the definition (6.2.4) of m 2 (if). 

We are prepared to apply the matrix Bernstein inequality, Theorem 6.1.1, to the random ma¬ 
trix Z-Y.kSk- This operation results in the statement of the corollary. □ 


6.2.3 Discussion 

One of the most common applications of the matrix Bernstein inequality is to analyze empirical 
matrix approximations. As a consequence, Corollary 6.2.1 is one of the most useful forms of the 
matrix Bernstein inequality. Let us discuss some of the important aspects of this result. 

Understanding the Bound on the Approximation Error 

First, let us examine how many samples n suffice to bring the approximation error bound in 
Corollary 6.2.1 below a specified positive tolerance e. Examining inequality (7.3.5), we find that 

2m 2 (if)log(di + d 2 ) , 2 Llog(di + d 2 ) . - 

n> - 7 ; -h- implies E\\R n -B\\ <2e. (6.2.7) 

3e 
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Roughly, the number n of samples should be on the scale of the per-sample second moment 
rri 2 [R) and the uniform upper bound L. 

The bound (6.2.7) also reveals an unfortunate aspect of empirical matrix approximation. To 
make the tolerance e small, the number n of samples must increase proportional with e -2 . In 
other words, it takes many samples to achieve a highly accurate approximation. We cannot avoid 
this phenomenon, which ultimately is a consequence of the central limit theorem. 

On a more positive note, it is quite valuable that the error bounds (7.3.5) and (7.3.6) involve 
the spectral norm. This type of estimate simultaneously controls the error in every linear func¬ 
tion of the approximation: 

||B„-B||<£ implies |tr(B„C) - tr(BC)| < £ when Hdlsj < 1. 

The Schatten 1-norm ||-|lsi is defined in (2.1.29). These bounds also control the error in each 
singular value a j (/?„) of the approximation: 

||B„-B[|<£ implies \crj(R n ) - crj[B)\ < e for each j - 1,2,3,...,minjdi,^}- 

When there is a gap between two singular values of B, we can also obtain bounds for the discrep¬ 
ancy between the associated singular vectors of R n and B using perturbation theory. 

To construct a good sampling estimator B, we ought to control both m 2 (B) and L. In prac¬ 
tice, this demands considerable creativity. This observation hints at the possibility of achieving 
a bias-variance tradeoff when approximating B. To do so, we can drop all of the “unimportant” 
terms in the representation (6.2.1), i.e., those whose sampling probabilities are small. Then we 
construct a random approximation B only for the “important” terms that remain. Properly ex¬ 
ecuted, this process may decrease both the per-sample second moment m 2 (B) and the upper 
bound L. The idea is analogous with shrinkage in statistical estimation. 


A General Sampling Model 

Corollary 6.2.1 extends beyond the sampling model based on the finite expansion (6.2.1). Indeed, 
we can consider a more general decomposition of the target matrix B: 

B - \ B(<w)d/i(<y) 

Jn 

where /i is a probability measure on a sample space II. As before, the idea is to represent the 
target matrix B as an average of “simple” matrices B (o>). The main difference is that the family 
of simple matrices may now be infinite. In this setting, we construct the random approximation 
B so that 

P>{Be£} = q{«:B(w)E£} for£cM dlxd2 
In particular, it follows that 


EB = B and ||B|| < sup ||B(w)||. 

wen 


As we will discuss, this abstraction is important for applications in machine learning. 
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Suboptimality of Sampling Estimators 

Another fundamental point about sampling estimators is that they are usually suboptimal. In 
other words, the matrix sampling estimator may incur an error substantially worse than the error 
in the best structured approximation of the target matrix. 

To see why, let us consider a simple form of low-rank approximation by random sampling. 
The method here does not have practical value, but it highlights the reason that sampling esti¬ 
mators usually do not achieve ideal results. Suppose that B has singular value decomposition 

N N 

B = (JiUjV* where ^ Uj = 1 and IV = minjdi, d 2 }. 
i = 1 /=1 

Given the SVD, we can construct a random rank-one approximation R of the form 

R-uiv* with probability oq. 

Per Corollary 6.2.1, the error in the associated sampling estimator R n of B satisfies 




21og(di + d 2 ) 


21og(di + d 2 ) 
n 


On the other hand, a best rank-n approximation of B takes the form B n = a ,-UjV* , and it 

/—l J J J 

incurs error 

II B n ~ B\\ — cr n +i < —. 

n+ 1 

The second relation is Markov’s inequality which provides an accurate estimate only when the 
singular values 0\, ...,cr n+ i are comparable. In that case, the sampling estimator arrives within a 
logarithmic factor of the optimal error. But there are many matrices whose singular values decay 
quickly, so that cr n +i « (n + 1) _1 . In the latter situation, the error in the sampling estimator is 
much worse than the optimal error. 


Warning: Frobenius-Norm Bounds 

We often encounter papers that develop Frobenius-norm error bounds for matrix approxima¬ 
tions, perhaps because the analysis is more elementary. But one must recognize that Frobenius- 
norm error bounds are not acceptable in most cases of practical interest: 

Frobenius-norm error bounds are typically vacuous. 

In particular, this phenomenon occurs in data analysis whenever we try to approximate a matrix 
that contains white or pink noise. 

To illustrate this point, let us consider the ubiquitous problem of approximating a low-rank 
matrix corrupted by additive white Gaussian noise: 

B — xx* + aE e where ||jc|| 2 = 1. (6.2.8) 

The desired approximation of the matrix B is the rank-one matrix B opt = xx*. For modeling 
purposes, we assume that E has independent normal( 0, d -1 ) entries. As a consequence, 

II £|| « 2 and ||£|| F «\/d. 
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Now, the spectral-norm error in the desired approximation satisfies 

ll-Bopt — -Bll = a II£11 ~ 2a. 

On the other hand, the Frobenius-norm error in the desired approximation satisfies 

ll-Bopt —BIIf = « II-BIIf ~ ccVd. 

We see that the Frobenius-norm error can be quite large, even when we find the required ap¬ 
proximation. 

Here is another way to look at the same fact. Suppose we construct an approximation B of 
the matrix B from (6.2.8) whose Frobenius-norm error is comparable with the optimal error: 

||B - Blip < esfd. 

There is no reason for the approximation B to have any relationship with the desired approxima¬ 
tion B opt- For example, the approximation B — aE satisfies this error bound with e-d~ 112 even 
though B consists only of noise. 

6.3 Application: Randomized Sparsification of a Matrix 

Many tasks in data analysis involve large, dense matrices that contain a lot of redundant in¬ 
formation. For example, an experiment that tabulates many variables about a large number of 
subjects typically results in a low-rank data matrix because subjects are often similar with each 
other. Many questions that we pose about these data matrices can be addressed by spectral 
computations. In particular, factor analysis involves a singular value decomposition. 

When the data matrix is approximately low rank, it has fewer degrees of freedom than its 
ambient dimension. Therefore, we can construct a simpler approximation that still captures 
most of the information in the matrix. One method for finding this approximation is to replace 
the dense target matrix by a sparse matrix that is close in spectral-norm distance. An elegant 
way to identify this sparse proxy is to randomly select a small number of entries from the original 
matrix to retain. This is a type of empirical approximation. 

Sparsification has several potential advantages. First, it is considerably less expensive to store 
a sparse matrix than a dense matrix. Second, many algorithms for spectral computation operate 
more efficiently on sparse matrices. 

In this section, we examine a very recent approach to randomized sparsification due to Kundu 
& Drineas [KD14]. The analysis is an immediate consequence of Corollary 6.2.1. See the notes at 
the end of the chapter for history and references. 

6.3.1 Problem Formulation & Randomized Algorithm 

Let B be a fixed d\ x d 2 complex matrix. The sparsification problem requires us to find a sparse 
matrix B that has small distance from B with respect to the spectral norm. We can achieve this 
goal using an empirical approximation strategy. 

First, let us express the target matrix as a sum of its entries: 

d\ d.2 

B-LLbijVij- 

i=ij=i 
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Introduce sampling probabilities 



\bjj\ 2 ^ \btj\ 

[|J»[|| + WBWe, 


for i = l,...,di and j - 


(6.3.1) 


The Frobenius norm is defined in (2.1.2), and the entrywise £\ norm is defined in (2.1.30). It is 
easy to check that the numbers pij form a probability distribution. Let us emphasize that the 
non-obvious form of the distribution (6.3.1) represents a decade of research. 

Now, we introduce a d\ x d 2 random matrix R that has exactly one nonzero entry: 


R =- bj j E/with probability p; 

Pij 


We use the convention that 0/0 = 0 so that we do not need to treat zero entries separately. It is 
immediate that 

d\ d,2 i d\ d2 

ER = E E — • b u E u ■ Pu = E E b u E u = B 

i= 1 7=1 PiJ 1 = 1 7=1 

Therefore, R is an unbiased estimate of B. 

Although the expectation of R is correct, its variance is quite high. Indeed, R has only one 
nonzero entry, while B typically has many nonzero entries. To reduce the variance, we combine 
several independent copies of the simple estimator: 


1 n 

Bn — — y' Rk where each R^ is an independent copy of R. 

11 k= 1 

By linearity of expectation, E R n = B. Therefore, the matrix R n has at most n nonzero entries, 
and its also provides an unbiased estimate of the target. The challenge is to quantify the error 
\\Rn ~ fill as a function of the sparsity level n. 


6.3.2 Performance of Randomized Sparsification 

The randomized sparsification method is clearly a type of empirical approximation, so we can 
use Corollary 6.2.1 to perform the analysis. We will establish the following error bound. 


/41|Blip • maxjdi, d 2 }log(di + tfe) 4 \\B\\ e , log(di + d 2 ) 

E\\R n -BW<\ -£-+-^-. (6.3.2) 

V n 3 n 

The short proof of (6.3.2) appears below in Section 6.3.3. 

Let us explain the content of the estimate (6.3.2). First, the bound (2.1.31) allows us to replace 
the £ i norm by the Frobenius norm: 

IIB Ilf! ^ V d\d 2 ■ ||B||f < max |r/ 1 , d 2 }-\\B\\ F . 

Placing the error (6.3.2) on a relative scale, we see that 


E[|R„-B[| ^ IIBIIp 
IIBII “ IIB|| 


4max{di, <^ 2 }log(^i + d 2 ) 


4max{d|, d 2 }log{di + d 2 ) 
3 n 
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The stable rank srank(B), defined in (2.1.25), emerges naturally as a quantity of interest. 

Now, suppose that the sparsity level n satisfies 

n > e~ 2 ■ srank(B) -max{d\, d 2 }log[di + d 2 ) 

where the tolerance e e (0,1]. We determine that 

E||«„-B|| 4 e 2 

-< 2e +-. 

I|B|| 3 x/srankfB) 

Since the stable rank always exceeds one and we have assumed that £ < 1, this estimate implies 
that 

E||R„-B|| „ 

-< 4 e. 

II tB || 

We discover that it is possible to replace the matrix B by a matrix with at most n nonzero entries 
while achieving a small relative error in the spectral norm. When srank(B) « rnin 1 d|, d 2 \, we 
can achieve a dramatic reduction in the number of nonzero entries needed to carry the spectral 
information in the matrix B. 


6.3.3 Analysis of Randomized Sparsification 

Let us proceed with the analysis of randomized sparsification. To apply Corollary 6.2.1, we need 
to obtain bounds for the per-sample variance m 2 {R) and the uniform upper bound L. The key 
to both calculations is to obtain appropriate lower bounds on the sampling probabilities pij. 
Indeed, 

1 \bij\ , .. I |/>«y| 2 


Pi **2 


and »;/>-• 

IIBII* 1 2 


IB 


|2 ‘ 


(6.3.3) 


Each estimate follows by neglecting one term in (6.3.3). 

First, we turn to the uniform bound on the random matrix R. We have 

II R II < max \\pj} bi yE ;j || = max — • | bij I < 21| B || . 

l i J ij Pij 

The last inequality depends on the first bound in (6.3.3). Therefore, we may take L-2\\B\\f 1 . 
Second, we turn to the computation of the per-sample second moment m 2 (R). We have 


d\ d,2 


E IRR*) = E E —' (bijEijHbijEijypi 


'=i 7=1 Pij 


\bij\ 


d\ d2 

= EE 

t-[pi Pij 

d\ £^2 

4 2\\B\\lY'Y,E , ii =2d 2 \\B\\l-l dl . 

i =1 y'=l 


The semidefinite inequality holds because each matrix |h/y| 2 E;,- is positive semidefinite and be¬ 
cause of the second bound in (6.3.3). Similarly, 

E{R*R)4 2d 1 l|B||p • I^ 2 . 
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In summary, 

ni 2 {R ) = max] ||E(JfJf*)||, |]E(JJ*B)||} < 2max{di,d2K 

This is the required estimate for the per-sample second moment. 

Finally, to reach the advertised error bound (6.3.2), we invoke Corollary 6.2.1 with the param¬ 
eters L - \\B\\( 1 and m 2 (if) < 2max{di,d 2 }- 


6.4 Application: Randomized Matrix Multiplication 

Numerical linear algebra (NLA) is a well-established and important part of computer science. 
Some of the basic problems in this area include multiplying matrices, solving linear systems, 
computing eigenvalues and eigenvectors, and solving linear least-squares problems. Histori¬ 
cally, the NLA community has focused on developing highly accurate deterministic methods 
that require as few floating-point operations as possible. Unfortunately, contemporary appli¬ 
cations can strain standard NLA methods because problems have continued to become larger. 
Furthermore, on modern computer architectures, computational costs depend heavily on com¬ 
munication and other resources that the standard algorithms do not manage very well. 

In response to these challenges, researchers have started to develop randomized algorithms 
for core problems in NLA. In contrast to the classical algorithms, these new methods make ran¬ 
dom choices during execution to achieve computational efficiencies. These randomized algo¬ 
rithms can also be useful for large problems or for modern computer architectures. On the other 
hand, randomized methods can fail with some probability, and in some cases they are less accu¬ 
rate than their classical competitors. 

Matrix concentration inequalities are one of the key tools used to design and analyze ran¬ 
domized algorithms for NLA problems. In this section, we will describe a randomized method 
for matrix multiplication developed by Magen & Zouzias [MZ11, Zoul3]. We will analyze this 
algorithm using Corollary 6.2.1. Turn to the notes at the end of the chapter for more information 
about the history. 


6.4.1 Problem Formulation & Randomized Algorithm 

One of the basic tasks in numerical linear algebra is to multiply two matrices with compatible 
dimensions. Suppose that B is a d\ x N complex matrix and that C is an N x d 2 complex matrix, 
and we wish to compute the product BC. The straightforward algorithm forms the product entry 
by entry: 

N 

(BC) ik =Y< b ijCjk for each 1 = 1. d\ and k- \,...,d 2 - (6.4.1) 

f=i 

This approach takes O {N- d \ d 2 ) arithmetic operations. There are algorithms, such as Strassen’s 
divide-and-conquer method, that can reduce the cost, but these approaches are not considered 
practical for most applications. 

Suppose that the inner dimension N is substantially larger than the outer dimensions d\ and 
d 2 - In this setting, both matrices B and C are rank-deficient, so the columns of B contain a lot of 
linear dependencies, as do the rows of C. As a consequence, a random sample of columns from 
B (or rows from C) can be used as a proxy for the full matrix. Formally, the key to this approach 
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is to view the matrix product as a sum of outer products: 

N 

BC=J^ b-jCy.. (6.4.2) 

7=1 


As usual, b j denotes the j th column of B, while cj. denotes the /th row of C. We can approximate 
this sum using the empirical method. 

To develop an algorithm, the first step is to construct a simple random matrix R that provides 
an unbiased estimate for the matrix product. To that end, we pick a random index and form a 
rank-one matrix from the associated columns of B and row of C. More precisely, define 


bj \\ 2 + \\Cj ..\\ 2 

ll«[|| + IIC||| 


for 7 = 1,2,3,..., AT. 


(6.4.3) 


The Frobenius norm is defined in (2.1.2). Using the properties of the norms, we can easily check 
that (pi, pz, Pz, ..., Piv) forms a bonafide probability distribution. The cost of computing these 
probabilities is at most 0(N • (d\ + dz)) arithmetic operations, which is much smaller than the 
cost of forming the product BC when d\ and dz are large. 

We now define a d\ x dz random matrix R by the expression 

R - b- j cj- with probability pj . 

Pj 


We use the convention that 0/0 = 0 so we do not have to treat zero rows and columns separately. 
It is straightforward to compute the expectation of R: 


N i N 

= £ — • b d c r- ■ Pi = L b J c i - li < 

7 = 1 P J 7 = 1 


As required, R is an unbiased estimator for the product BC. 

Although the expectation of R is correct, its variance is quite high. Indeed, R has rank one, 
while the rank of BC is usually larger! To reduce the variance, we combine several independent 
copies of the simple estimator: 

1 n 

Rn — — y~, Rk where each R^ is an independent copy of R. (6.4.4) 

\ 


By linearity of expectation, E R„ - BC, so we imagine that R„ approximates the product well. 

To see whether this heuristic holds true, we need to understand how the error E || R n - BC|| 
depends on the number n of samples. It costs 0(n ■ d\ dz) floating-point operations to deter¬ 
mine all the entries of R n . Therefore, when the number n of samples is much smaller than the 
inner dimension N of the matrices, we can achieve significant economies over the naive matrix 
multiplication algorithm. 

In fact, it requires no computation beyond sampling the row/column indices to express R n in 
the form (6.4.4). This approach gives an inexpensive way to represent the product approximately. 
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6.4.2 Performance of Randomized Matrix Multiplication 

To simplify our presentation, we will assume that both matrices have been scaled so that their 
spectral norms are equal to one: 

l|B|| = ||C|| = l. 

It is relatively inexpensive to compute the spectral norm of a matrix accurately, so this prepro¬ 
cessing step is reasonable. 

Let asr = |(srank(B) + srank(C)) be the average stable rank of the two factors; see (2.1.25) for 
the definition of the stable rank. In §6.4.3, we will prove that 


E\\R n -BC\\ < 


4 • asr • log(di + dz) 2 • asr • log(rfi + dz) 


n 3n 

To appreciate what this estimate means, suppose that the number n of samples satisfies 

n > e~ 2 • asr • log(di + dz) 


(6.4.5) 


where £ is a positive tolerance. Then we obtain a relative error bound for the randomized matrix 
multiplication method 

E\\R n -BC\\ 2 2 

- < 2 £ + —£ . 

IIBIIIICII 3 

This expression depends on the normalization of B and C. The computational cost of forming 
the approximation is 

0(£ -2 • asr - d\dz log(r/| + dz)) arithmetic operations. 

In other words, when the average stable rank asr is substantially smaller than the inner dimen¬ 
sion N of the two matrices B and C, the random estimate R n for the product BC achieves a small 
error relative to the scale of the factors. 


6.4.3 Analysis of Randomized Matrix Multiplication 

The randomized matrix multiplication method is just a specific example of empirical approxi¬ 
mation, and the error bound (6.4.5) is an immediate consequence of Corollary 6.2.1. 

To pursue this approach, we need to establish a uniform bound on the norm of the estimator 
R for the product. Observe that 


IIB|| < maxj \\pj 1 bjCj : \\ = max ; - 


M II Cj: II 
Pi 


To obtain a bound, recall the value (6.4.3) of the probability pj, and invoke the inequality be¬ 
tween geometric and arithmetic means: 


HR||<(||B 


\ 2 ¥ +\\C\\ 2 ¥ ) 


■+ c; 




l^+l|C|| 2 ). 


Since the matrices B and C have unit spectral norm, we can express this inequality in terms of 
the average stable rank: 

||if|| < ^(srank(B) + srank(C)) = asr. 
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This is the exactly kind of bound that we need. 

Next, we need an estimate for the per-sample second moment m 2 (R). By direct calculation, 


N 1 

E (RR*)='£ i - I -lb: J C J :){b :J Cjr-p J 
i = 1 P i 


= (iib[i!+iich!)-£ 

=$ (lIBIIp + IIC||p) -BB*. 


IIC/: 


1=1 wb-.jr + Wcj 


bjb * 

||2 J ■] 


The semidefinite relation holds because each fraction lies between zero and one, and each ma¬ 
trix is positive semidefinite. Therefore, increasing the fraction to one only increases in the 
matrix in the semidefinite order. Similarly, 

E(M*)=$(IIB||§ + ||C||f)-C*C. 


In summary, 

m 2 (R) = max{ ||E(RR*)||, ||E( J R*«)|| 

S (IIBlip + IIClip) -max{ ||BB* ||, ||C*C||} 

= (l|B||p + ||C||p) 

= 2 • asr. 

The penultimate line depends on the identity (2.1.24) and our assumption that both matrices B 
and C have norm one. 

Finally, to reach the stated estimate (6.4.5), we apply Corollary 6.2.1 with the parameters L = 
asr and m 2 (R) < 2 • asr. 


6.5 Application: Random Features 

As a final application of empirical matrix approximation, let us discuss a contemporary idea 
from machine learning called random features. Although this technique may appear more so¬ 
phisticated than randomized sparsification or randomized matrix multiplication, it depends 
on exactly the same principles. Random feature maps were proposed by Ali Rahimi and Ben 
Recht [RR07] . The analysis in this section is due to David Lopez-Paz et al. [LPSS + 14]. 

6.5.1 Kernel Matrices 

Let be a set. We think about the elements of the set 3£ as (potential) observations that we 
would like to use to perform learning and inference tasks. Let us introduce a bounded measure 
® of similarity between pairs of points in the set: 


<5:^Tx jr-[-l, + l]. 

The similarity measure <J> is often called a kernel. We assume that the kernel returns the value +1 
when its arguments are identical, and it returns smaller values when its arguments are dissimilar. 
We also assume that the kernel is symmetric; that is, <J>(x, y) = <J>(y, jc) for all arguments x, y £ . 
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A simple example of a kernel is the angular similarity between a pair of points in a Euclidean 
space: 



(6.5.1) 


We write /(•,■) for the planar angle between two vectors, measured in radians. As usual, we 
instate the convention that 0/0 = 0. See Figure 6.1 for an illustration. 

Suppose that xi,..., jcjv e SC are observations. The kernel matrix G = [gij] £ M n just tabulates 
the values of the kernel function for each pair of data points: 


gij = ® (*;, Xj) for i, j = 1,..., N. 


It may be helpful to think about the kernel matrix G as a generalization of the Gram matrix of a 
family of points in a Euclidean space. We say that the kernel <J> is positive definite if the kernel 
matrix G is positive semidefinite for any choice of observations {xfi c SC. We will be concerned 
only with positive-definite kernels in this discussion. 

In the Euclidean setting, there are statistical learning methods that only require the inner 
product between each pair of observations. These algorithms can be extended to the kernel set¬ 
ting by replacing each inner product with a kernel evaluation. As a consequence, kernel matrices 
can be used for classification, regression, and feature selection. In these applications, kernels 
are advantageous because they work outside the Euclidean domain, and they allow task-specific 
measures of similarity. This idea, sometimes called the kernel trick, is one of the major insights 
in modern machine learning. 

A significant challenge for algorithms based on kernels is that the kernel matrix is big. In¬ 
deed, G contains 0[N 2 ) entries, where N is the number of data points. Furthermore, the cost 
of constructing the kernel matrix is 0[dN 2 ) where d is the number of parameters required to 
specify a point in the universe SC. 

Nevertheless, there is an opportunity. Large data sets tend to be redundant, so the kernel 
matrix also tends to be redundant. This manifests in the kernel matrix being close to a low-rank 
matrix. As a consequence, we may try to replace the kernel matrix by a low-rank proxy. For some 
similarity measures, we can accomplish this task using empirical approximation. 

6.5.2 Random Features and Low-Rank Approximation of the Kernel Matrix 

In certain cases, a positive-definite kernel can be written as an expectation, and we can take ad¬ 
vantage of this representation to construct an empirical approximation of the kernel matrix. Let 
us begin with the general construction, and then we will present a few examples in Section 6.5.3. 

Let W be a sample space equipped with a sigma-algebra and a probability measure p. Intro¬ 
duce a bounded feature map: 


y/: SC y-Tf —*■[-&,+&] where b > 0. 


Consider a random variable w taking values in W and distributed according to the measure p. 
We assume that this random variable satisfies the reproducing property 


<t>(x, y) = E w [y/(x; w ) • y/(y; ud] for all x,ye 3fi. 


(6.5.2) 


The pair [yr, w ) is called a random feature map for the kernel ®. 
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We want to approximate the kernel matrix with a set {jti,..., x^} c of observations. To do 
so, we draw a random vector wS-W distributed according to /i. Form a random vector z e U N by 
applying the feature map to each data point with the same choice of the random vector w. That 
is, 


' Zl ' 


i//(xi; w) 

ZN . 


i//(xjv; w) 


The vector z is sometimes called a random feature. By the reproducing property (6.5.2) for the 
random feature map, 


gij = <b(.Xi,Xj) = E w [y/{Xj;w)-if/{Xj;w)] = E w [z r zj] for i,j = 1,2,3 ,...,N. 

We can write this relation in matrix form as G = E(zz*). Therefore, the random matrix R - zz* is 
an unbiased rank-one estimator for the kernel matrix G. This representation demonstrates that 
random feature maps, as defined here, only exist for positive-definite kernels. 

As usual, we construct a better empirical approximation of the kernel matrix G by averaging 
several realizations of the simple estimator R: 

1 n 

Hn — — Rk where each Rk is an independent copy of R. (6.5.3) 

n k= 1 

In other words, we are using n independent random features zi,..., z« to approximate the kernel 
matrix. The question is how many random features are needed before our estimator is accurate. 

6.5.3 Examples of Random Feature Maps 

Before we continue with the analysis, let us describe some random feature maps. This discussion 
is tangential to our theme of matrix concentration, but it is valuable to understand why random 
feature maps exist. 

First, let us consider the angular similarity (6.5.1) defined on K rl . We can construct a random 
feature map using a classical result from plane geometry. If we draw w uniformly from the unit 
sphere S rf_1 c then 

<f>(x;y) - 1 - ~ £ w f S gn(x, w) -sgn(y, w)l for all x,y £ . (6.5.4) 

n 

The easy proof of this relation should be visible from the diagram in Figure 6.1. In light of the 
formula (6.5.4), we set W = § d_1 with the uniform measure, and we define the feature map 

y/(x; w) — sgn(x, w). 

The reproducing property (6.5.2) follows immediately from (6.5.4). Therefore, the pair (i//, w) is a 
random feature map for the angular similarity kernel. 

Next, let us describe an important class of kernels that can be expressed using random feature 
maps. A kernel on U rl is translation invariant if there is a function </?: lR rf —► IR for which 


<J>(jr,y) = ip[x- y) forallx,y e RS rf . 
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(. x , u) — 0 



sgn(x, M>-sgn(y, m} = +1 


sgn(x, M>-sgn(y, u) = -1 


Figure 6.1: The angular similarity between two vectors. Let x and y be nonzero vectors in 
R 2 with angle Z{x,y). The red region contains the directions u where the product sgn (x, u) ■ 
sgn (y, u) equals +1, and the blue region contains the directions u where the same product 
equals -1. The blue region subtends a total angle of 2Z(x, y), and the red region subtends a 
total angle of 2n - 2A. (jc, y). 


Bochner’s Theorem, a classical result from harmonic analysis, gives a representation for each 
continuous, positive-definite, translation-invariant kernel: 

0(x,y) = (p(x-y) = c f e i{x ’ w) ■ e~ i<y ’ w) du(w) for all x,y £ (6.5.5) 

J Rrf 

In this expression, the positive scale factor c and the probability measure p depend only on the 
function <p. The formula (6.5.5) yields a (complex-valued) random feature map: 

if/dx\ w) - \fcC' x ' w> where w has distribution p on R rl . 

This map satisfies a complex variant of the reproducing property (6.5.2): 

<l)(x,y) = E,„ [if/ c (x; w)-y/dy; w)*] for all x,y e U d , 

where we have written * for complex conjugation. 

With a little more work, we can construct a real-valued random feature map. Recall that the 
kernel <5 is symmetric, so the complex exponentials in (6.5.5) can be written in terms of cosines. 
This observation leads to the random feature map 

y/{x; w,U) - V2ccos [(x, w)+ U) where w ~ p and U ~ uniform [0,2jt]. (6.5.6) 

To verify that (ip, (in, IT)) reproduces the kernel <1>, as required by (6.5.2), we just make a short 
calculation using the angle-sum formula for the cosine. 
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We conclude this section with the most important example of a random feature map from 
the class we have just described. Consider the Gaussian radial basis function kernel: 

<t>(x,y) = e - a \U-y\\ 2 l2 f or a n Xj y £ 

The positive parameter a reflects how close two points must be before they are regarded as “sim¬ 
ilar.” For the Gaussian kernel, Bochner’s Theorem (6.5.5) holds with the scaling factor c- 1 and 
the probability measure p = normal(0, cl 1^). In summary, we define 

y/[x; w, U) — v / 2cos(<jr, w) + U) where w ~ normal(0, aid) and U ~ uniform [0,27r|. 

This random feature map reproduces the Gaussian radial basis function kernel. 


6.5.4 Performance of the Random Feature Approximation 

We will demonstrate that the approximation R n of the N x N kernel matrix G using n random 
features, constructed in (6.5.3), leads to an estimate of the form 


E||fl„-G||< 


2bN\\G\\ log(2N) 


2bNlog{2N) 
3 n 


(6.5.7) 


In this expression, b is the uniform bound on the magnitude of the feature map 
proof of (6.5.7) appears in §6.5.5. 

To clarify what this result means, we introduce the intrinsic dimension of the 
matrix G: 


intdim(G) = srank(G 1/2 


trG 

FgI 


N 

FgI' 


if/. The short 
N x N kernel 


The stable rank is defined in Section 2.1.15. We have used the assumption that the similarity 
measure is positive definite to justify the computation of the square root of the kernel matrix, 
and trG = N because of the requirement that ®(jc, jc) = +1 for all x e . See §7.1 for further 
discussion of the intrinsic dimension 

Now, assume that the number n of random features satisfies the bound 


n > 2be 2 • intdim(G) • log(2/V), 


In view of (6.5.7), the relative error in the empirical approximation of the kernel matrix satisfies 

EUR*-G|| _ 2 

-<£ + £ . 

II G|| 

We learn that the randomized approximation of the kernel matrix G is accurate when its intrinsic 
dimension is much smaller than the number of data points. That is, intdim(G) « N. 


6.5.5 Analysis of the Random Feature Approximation 

The analysis of random features is based on Corollary 6.2.1. To apply this result, we need the 
per-sample second-moment m 2 (it) and the uniform upper bound L. Both are easy to come by. 
First, observe that 


IIII = IIzz* II = l|z|| 2 < bN 
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Recall that b is the uniform bound on the feature map i f/, and N is the number of components in 
the random feature vector z. 

Second, we calculate that 

ER 2 = E( \\z\\ 2 zz*) 4 bN-E(zz*) = bN ■ G. 

Each random matrix zz* is positive semidefinite, so we can introduce the upper bound ||z|| 2 < 
bN. The last identity holds because R is an unbiased estimator of the kernel matrix G. It follows 
that 

m 2 lR) = l|Eif 2 ||<WV-||G||. 

This is our bound for the per-sample second moment. 

Finally, we invoke Corollary 6.2.1 with parameters L — bN and m-^lR) < bN ||G|| to arrive at 
the estimate (6.5.7). 

6.6 Proof of the Matrix Bernstein Inequality 

Now, let us turn to the proof of the matrix Bernstein inequality, Theorem 6.1.1. This result is a 
corollary of a matrix concentration inequality for a sum of bounded random Hermitian matrices. 
We begin with a statement and discussion of the Hermitian result, and then we explain how the 
general result follows. 

6.6.1 A Sum of Bounded Random Hermitian Matrices 

The first result is a Bernstein inequality for a sum of independent, random Hermitian matrices 
whose eigenvalues are bounded above. 

Theorem 6.6.1 (Matrix Bernstein: Hermitian Case). Consider a finite sequence {X^} of indepen¬ 
dent, random, Hermitian matrices with dimension d. Assume that 

EXk - 0 and A max (Xfc) < L for each index k. 

Introduce the random matrix 

Y = Lk X *' 

Let v{Y) be the matrix variance statistic of the sum: 

idr) = HEr 2 u = ||£ fc EX 2 ||. 

Then 

EAmax(^) < ^2v{Y)\ogd+ ^Llogd. 

Furthermore, for all t > 0. 


( 6 . 6 . 1 ) 

( 6 . 6 . 2 ) 

(6.6.3) 


The proof of Theorem 6.6.1 appears below in §6.6. 
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6.6.2 Discussion 

Theorem 6.6.1 also yields information about the minimum eigenvalue of an independent sum 
of d-dimensional Hermitian matrices. Suppose that the independent random matrices satisfy 

EX/,- = 0 and Amm(Xfc) > —L for each index k. 

Applying the expectation bound (6.6.2) to - Y, we obtain 

EA min (F) > -^2v(Y)\ogd- ^Llogd. (6.6.4) 

We can use (6.6.3) to develop a tail bound. For t > 0, 

I -t 2 /2 \ 

P {A m ; n (F) <-t}<d- exp - . 

P V v(Y) + Ltl3) 

Let us emphasize that the bounds for A max (F) and A m ; n (F) may diverge because the two pa¬ 
rameters L and L can take sharply different values. This fact indicates that the maximum eigen¬ 
value bound in Theorem 6.6.1 is a less strict assumption than the spectral norm bound in Theo¬ 
rem 6.1.1. 

6.6.3 Bounds for the Matrix Mgf and Cgf 

In establishing the matrix Bernstein inequality, the main challenge is to obtain an appropriate 
bound for the matrix mgf and cgf of a zero-mean random matrix whose norm satisfies a uniform 
bound. We do not present the sharpest estimate possible, but rather the one that leads most 
directly to the useful results stated in Theorem 6.6.1. 

Lemma 6.6.2 (Matrix Bernstein: Mgf and Cgf Bound). Suppose that X is a random Hermitian 
matrix that satisfies 

EX = 0 and A max (X) < L. 

Then, for 0 < 0 < 3/L, 

HX I 0 2 /2 ,) n X e 2 l 2 ? 

Ee ex =^exp-EX 2 and logEe 0x ^-EX 2 . 

P U-0L/3 I 1 - 6LI3 

Proof. Fix the parameter 6 > 0. In the exponential e ex , we would like to expose the random 
matrix X and its square X 2 so that we can exploit information about the mean and variance. To 
that end, we write 

e ex - I + 6X + (e 0x - 6X -1) = I + OX + X ■ /(X) • X, (6.6.5) 

where / is a function on the real line: 

e 9x -0x-l e 2 

fix)- - - - for x fO and fiO)-—. 

The function / is increasing because its derivative is positive. Therefore, fix) < /(L) when x < L. 
By assumption, the eigenvalues of X do not exceed L, so the Transfer Rule (2.1.14) implies that 


/(X) =4 fiL) ■ I. 


(6.6.6) 
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The Conjugation Rule (2.1.12) allows us to introduce the relation (6.6.6) into our expansion (6.6.5) 
of the matrix exponential: 

e BX 4 1 + GX + X{f{L ) I)X = I + 0X + f{L) ■ X 2 . 


This relation is the basis for our matrix mgf bound. 

To obtain the desired result, we develop a further estimate for f{L). This argument involves 
a clever application of Taylor series: 


m = 


e 0i — QL—1 
L 2 


1 ” ( BL )« ___ 6 2 “ ( 0 L )'?" 2 

v feli-* h 3<? " 2 


e 2 /2 

1 - 6LI3 


The second expression is simply the Taylor expansion of the fraction, viewed as a function of 
8. We obtain the inequality by factoring out (0L) 2 /2 from each term in the series and invoking 

the bound q\ > 2 • 3 q ~ 2 , valid for each q = 2,3,4, _Sum the geometric series to obtain the final 

identity. 

To complete the proof of the mgf bound, we combine the last two displays: 

ftx 0 2 /2 , 

e 0x ^I + 0X +-X 2 . 

1-01/3 


This estimate is valid because X 2 is positive semidefinite. Expectation preserves the semidefinite 
order, so 


Ee 0x ^I + 


0 2 12 
1-0L/3 


• EX 2 =<: exp 


9 2 12 
1-8L/3 



We have used the assumption that X has zero mean. The second semidefinite relation follows 
when we apply the Transfer Rule (2.1.14) to the inequality 1 + a < e a , which holds for a e R. 

To obtain the semidefinite bound for the cgf, we extract the logarithm of the mgf bound using 
the fact (2.1.18) that the logarithm is operator monotone. □ 


6.6.4 Proof of the Hermitian Case 

We are prepared to establish the matrix Bernstein inequalities for random Hermitian matrices. 

Proof of Theorem 6.6.1. Consider a finite sequence {X^} of random Hermitian matrices with di¬ 
mension d. Assume that 

EXfc = 0 and A max (Xfc) < L for each index k. 

The matrix Bernstein cgf bound, Lemma 6.6.2, provides that 

q 2 i2 

log Ee 8Xk =4 g(0) ■ EX? where g(0) =--—— for 0 < 0 < 31L. (6.6.7) 

1 — OLI3 

Introduce the sum Y = ZtXfc- 

We begin with the bound (6.6.2) for the expectation EA max (F). Invoke the master inequality, 
relation (3.6.1) in Theorem 3.6.1, to find that 

EA max (F) < inf -l-logtrexpf^logEe 0 ^-) 

0 >o 0 \ i 

s ixiSuL ^ lo s tr “p(s (9) £i EX () 

%<“/! 5 1 °i! ,r Mp(frOT'Ei' 2 ). 
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As usual, to move from the first to the second line, we invoke the fact (2.1.16) that the trace 
exponential is monotone to introduce the semidefinite bound (6.6.7) for the cgf. Then we use 
the additivity rule (2.2.5) for the variance of an independent sum to identify E Y 2 . The rest of the 
argument glides along a well-oiled track: 


EA 


1 


max (F) <_ inf -log [d A 

n 


0<G<3/L 6 


:(exp(g(0)-EF 2 ))] 


= inf 

O<0<3 IL 


1 

9 


log [d exp (g(0) • ^max (EF 2 ))] 


< inf 

O<0<3 IL 


1 

9 


log [d exp(g(0) ■ f(F))] 


= inf 

0<6<3IL 


log d 

~e~ + 


91 2 

1 - 6L/3 


■ v(Y) 


In the first inequality, we bound the trace of the exponential by the dimension d times the max¬ 
imum eigenvalue. The next line follows from the Spectral Mapping Theorem, Proposition 2.1.3. 
In the third line, we identify the matrix variance statistic y(F) from (6.6.1). Afterward, we extract 
the logarithm and simplify. Finally, we compute the infimum to complete the proof of (6.6.2). 
For reference, the optimal argument is 


^ 6L log d + 9^/2 i/(F) logd 

2L 2 1 + 9(F) + 6L\/2(F) logd ’ 

We recommend using a computer algebra system to confirm this point. 

Next, we develop the tail bound (6.6.3) for A max (F). Owing to the master tail inequality (3.6.3), 
we have 

PU max (F) > t} < inf e~ et trexp (£ fc log Ee eXfc ) 

£ oi n i//. e ^' treXP ( g(0) ^^ E ^) 

< inf de~ St exp (g(0) • f(F)). 

O<0<3 IL 

The justifications are the same as before. The exact value of the infimum is messy, so we proceed 
with the inspired choice 9 - tl ( 0 (F) + Ltl 3), which results in the elegant bound (6.6.3). □ 


6.6.5 Proof of the General Case 

Finally, we explain how to derive Theorem 6.1.1, for general matrices, from Theorem 6.6.1. This 
result follows immediately when we apply the matrix Bernstein bounds for ffermitian matrices 
to the Hermitian dilation of a sum of general matrices. 

Proof of Theorem 6.1.1. Consider a finite sequence {S^l of d\ x d 2 random matrices, and assume 
that 

ESfc = 0 and ||S,tl|<L for each index k. 

We define the two random matrices 


z =Lk s k and r = jm) = £t^(,Sfc) 
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where Jrf? is the Hermitian dilation (2.1.26). The second expression for Y follows from the prop¬ 
erty that the dilation is a real-linear map. 

We will apply Theorem 6.6.1 to analyze ||Z||. First, recall the fact (2.1.28) that 

\\Z\\ = \ max {J4?{Z)) = A max (F). 

Next, we express the variance (6.6.1) of the random Hermitian matrix Y in terms of the general 
matrix Z. Indeed, the calculation (2.2.10) of the variance statistic of a dilation shows that 

vlY) = v(Jr(Y))= vIZ). 

Recall that the matrix variance statistic v{Z) defined in (6.1.1) coincides with the general defini¬ 
tion from (2.2.8). Finally, we invoke Theorem 6.6.1 to establish Theorem 6.1.1. □ 

6.7 Notes 

The literature contains a wide variety of Bernstein-type inequalities in the scalar case, and the 
matrix case is no different. The applications of the matrix Bernstein inequality are also numer¬ 
ous. We only give a brief summary here. 

6.7.1 Matrix Bernstein Inequalities 

David Gross [Groll] and Ben Recht [Recll] used the approach of Ahlswede & Winter [AW02] to 
develop two different versions of the matrix Bernstein inequality. These papers helped to popu¬ 
larize the use matrix concentration inequalities in mathematical signal processing and statistics. 
Nevertheless, their results involve a suboptimal variance parameter of the form 

VAwm=Z k \\ Ex 2 k\\- 

This parameter can be significantly larger than the matrix variance statistic (6.6.1) that appears 
in Theorem 6.6.1. They do coincide in some special cases, such as when the summands are 
independent and identically distributed. 

Oliveira [OlilOa] established the first version of the matrix Bernstein inequality that yields the 
correct matrix variance statistic (6.6.1). He accomplished this task with an elegant application 
of the Golden-Thompson inequality (3.3.3). His method even gives a result, called the matrix 
Freedman inequality, that holds for matrix-valued martingales. His bound is roughly equivalent 
with Theorem 6.6.1, up to the precise value of the constants. 

The matrix Bernstein inequality we have stated here, Theorem 6.6.1, first appeared in the 
paper [Trollc, §6] by the author of these notes. The bounds for the expectation are new. The 
argument is based on Lieb’s Theorem, and it also delivers a matrix Bennett inequality. This paper 
also describes how to establish matrix Bernstein inequalities for sums of unbounded random 
matrices, given some control over the matrix moments. 

The research in [Trollc] is independent from Oliveira’s work [OlilOa], although Oliveira’s pa¬ 
per motivated the subsequent article [Trolla] and the technical report [Trollb], which explain 
how to use Lieb’s Theorem to study matrix martingales. The technical report [GT14] develops a 
Bernstein inequality for interior eigenvalues using the Lieb-Seiringer Theorem [LS05] . 

For more versions of the matrix Bernstein inequality, see Vladimir Koltchinskii’s lecture notes 
from Saint-Flour [Kolll]. In Chapter 7, we present another extension of the matrix Bernstein 
inequality that involves a smaller dimensional parameter. 
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6.7.2 The Matrix Rosenthal-Pinelis Inequality 

The matrix Rosenthal-Pinelis inequality (6.1.6) is a close cousin of the matrix Rosenthal inequal¬ 
ity (5.1.9). Both results are derived from the noncommutative Khintchine inequality (4.7.1) us¬ 
ing the same pattern of argument [CGT12a, Thm. A.l]. We believe that [CGT12a] is the first 
paper to recognize and state the result (6.1.6), even though it is similar in spirit with the work 
in [Rud99]. A self-contained, elementary proof of a related matrix Rosenthal-Pinelis inequality 
appears in [MJC + 14, Cor. 7.4]. 

Versions of the matrix Rosenthal-Pinelis inequality first appeared in the literature [JX03] on 
noncommutative martingales, where they were called noncommutative Burkholder inequalities. 
For an application to random matrices, see the follow-up work [JX08] by the same authors. Sub¬ 
sequent papers [JZ12, JZ13] contain related noncommutative martingale inequalities inspired by 
the research in [OlilOb, Trollc]. 

6.7.3 Empirical Approximation 

Matrix approximation by random sampling is a special case of a general method that Bernard 
Maurey developed to compute entropy numbers of convex hulls. Let us give a short presentation 
of the original context, along with references to some other applications. 


Empirical Bounds for Covering Numbers 

Suppose that X is a Banach space. Consider the convex hull E = convjei,..., ejvl of a set of N 
points in X, and assume that lle^-H < L. We would like to give an upper bound for the number of 
balls of radius e it takes to cover this set. 

Fix a point ueE, and express u as a convex combination: 

N N 

u=Y.Pi e i w here £ p,- = 1 and Pi- °- 

i =1 i= 1 


Let x be the random vector in X that takes value with probability p t. We can approximate the 
point was an average x = Z/L, X/t of independent copies xi,...,x n ofthe random vector x. 
Then 

E \\x n - n || x = ^ E | £" =1 - E *) II x ^ \ E IIE 'Ll 9k*k II x * \ ( E II LLt 9k*k ||\)^ ■ 


The family {g^} consists of independent Rademacher random variables. The first inequality de¬ 
pends on the symmetrization procedure [LT91, Lem. 6.3], and the second is Holder’s. In certain 
Banach spaces, a Khintchine-type inequality holds: 


E \\x n 


2 T 2 [X)L 
\fn 


The last inequality depends on the uniform bound ||e^|| < L. This estimate controls the expected 
error in approximating an arbitrary point in E by randomized sampling. 

The number Tz{X) is called the type two constant of the Banach space X, and it can be es¬ 
timated in many concrete instances; see [LT91, Chap. 9] or [Pis89, Chap. 11]. For our purposes, 
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the most relevant example is the Banach space M' 1 ' xdl consisting of d\ x d 2 matrices equipped 
with the spectral norm. Its type two constant satisfies 

r 2 (y dixd2 ) < Const - yiog(di + d 2 ). 

This result follows from work of Tomczak-Jaegermann [TJ74, Thm. 3.1(h)]. In fact, the space 
IVD rf i xd 2 enjoys an even stronger property with respect to averages, namely the noncommutative 
Khintchine inequality (4.7.1). 

Now, suppose that the number n of samples in our empirical approximation x n of the point 
ueE satisfies 

12 T 2 [X)L\ 2 

"*l— H • 

Then the probabilistic method ensures that there is a some collection of ui,...,u n of points 
drawn with repetition from the set {ei ,..., e ; y! that satisfies 

\\{ n ~ l L'lc=l U k)- u \\ X - E - 

There are at most N n different ways to select the points u^. It follows that we can cover the 
convex hull E - conv{ei,..., ey/} in X with at most N n norm balls of radius e. 

History and Applications of Empirical Approximation 

Maurey did not publish his ideas, and the method was first broadcast in a paper of Pisier [Pis81, 
Lem. 1]. Another early reference is the work of Carl [Car85, Lem. 1]. More recently, this covering 
argument has been used to study the restricted isomorphism behavior of a random set of rows 
drawn from a discrete Fourier transform matrix [RV06] . 

By now, empirical approximation has appeared in a wide range of applied contexts, although 
many papers do not recognize the provenance of the method. Let us mention some examples 
in machine learning. Empirical approximation has been used to study what functions can be 
approximated by neural networks [Bar93, LBW96]. The same idea appears in papers on sparse 
modeling, such as [SSS08], and it supports the method of random features [RR07]. Empirical 
approximation also stands at the core of a recent algorithm for constructing approximate Nash 
equilibria [Barl4]. 

It is difficult to identify the earliest work in computational mathematics that invoked the 
empirical method to approximate matrices. The paper of Achlioptas & McSherry [AM01] on ran¬ 
domized sparsification is one possible candidate. 

Corollary 6.2.1, which we use to perform the analysis of matrix approximation by sampling, 
does not require the full power of the matrix Bernstein inequality, Theorem 6.1.1. Indeed, Corol¬ 
lary 6.2.1 can be derived from the weaker methods of Ahlswede & Winter [AW02]; for example, 
see the papers [Groll, Recll]. 

6.7.4 Randomized Sparsification 

The idea of using randomized sparsification to accelerate spectral computations appears in a 
paper of Achlioptas & McSherry [AMO 1,AM07]. d’Aspremont [d’All] proposed to use sparsifica¬ 
tion to accelerate algorithms for semidefinite programming. The paper [AKL13] by Achlioptas, 
Karnin, & Liberty recommends sparsification as a mechanism for data compression. 
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After the initial paper [AMO 1 ], several other researchers developed sampling schemes for ran¬ 
domized sparsification [AHK06, GT09]. Later, Drineas & Zouzias [DZ11] pointed out that matrix 
concentration inequalities can be used to analyze this type of algorithm. The paper [AKL13] re¬ 
fined this analysis to obtain sharper bounds. The simple analysis here is drawn from a recent 
note by Kundu & Drineas [KD14]. 

6.7.5 Randomized Matrix Multiplication 

The idea of using random sampling to accelerate matrix multiplication appeared in nascent form 
in a paper of Frieze, Kannan, & Vempala [FKV98] . The paper [DK01] of Drineas & Kannan devel¬ 
ops this idea in full generality, and the article [DKM06] of Drineas, Kannan, & Mahoney contains 
a more detailed treatment. Subsequently, Tamas Sarlos obtained a significant improvement in 
the performance of this algorithm [Sar06]. Rudelson & Vershynin [RV07] obtained the first error 
bound for approximate matrix multiplication with respect to the spectral norm. The analysis 
that we presented is adapted from the dissertation [Zoul3] of Tassos Zouzias, which refines an 
earlier treatment by Magen & Zouzias [MZ11]. See the monographs of Mahoney [Mahll] and 
Woodruff [Woo 14] for a more extensive discussion. 

6.7.6 Random Features 

Our discussion of kernel methods is adapted from the book [SS98]. The papers [RR07, RR08] of 
Ali Rahimi and Ben Recht proposed the idea of using random features to summarize data for 
large-scale kernel machines. The construction (6.5.6) of a random feature map for a translation- 
invariant, positive-definite kernel appears in their work. This approach has received a significant 
amount of attention over the last few years, and there has been a lot of subsequent development. 
For example, the paper [KK12] of Kar & Karnick shows how to construct random features for 
inner-product kernels, and the paper [3TXGD14] of ffamid et al. develops random features for 
polynomial kernels. Our analysis of random features using the matrix Bernstein inequality is 
drawn from the recent article [LPSS + 14] of Lopez-Paz et al. The presentation here is adapted 
from the author’s tutorial on randomized matrix approximation, given at ICML 2014 in Beijing. 
We recommend the two papers [HXGD14, LPSS + 14] for an up-to-date bibliography. 



CHAPTER 



Results Involving 
the Intrinsic Dimension 


A minor shortcoming of our matrix concentration results is the dependence on the ambient 
dimension of the matrix. In this chapter, we show how to obtain a dependence on an intrin¬ 
sic dimension parameter, which occasionally is much smaller than the ambient dimension. In 
many cases, intrinsic dimension bounds offer only a modest improvement. Nevertheless, there 
are examples where the benefits are significant enough that we can obtain nontrivial results for 
infinite-dimensional random matrices. 

In this chapter, present a version of the matrix Chernoff inequality that involves an intrin¬ 
sic dimension parameter. We also describe a version of the matrix Bernstein inequality that 
involves an intrinsic dimension parameter. The intrinsic Bernstein result usually improves on 
Theorem 6.1.1. These results depend on a new argument that distills ideas from a paper [Minll] 
of Stanislav Minsker. We omit intrinsic dimension bounds for matrix series, which the reader 
may wish to develop as an exercise. 

To give a sense of what these new results accomplish, we revisit some of the examples from 
earlier chapters. We apply the intrinsic Chernoff bound to study a random column submatrix of 
a fixed matrix. We also reconsider the randomized matrix multiplication algorithm in light of the 
intrinsic Bernstein bound. In each case, the intrinsic dimension parameters have an attractive 
interpretation in terms of the problem data. 


Overview 

We begin our development in §7.1 with the definition of the intrinsic dimension of a matrix. In 
§7.2, we present the intrinsic Chernoff bound and some of its consequences. In §7.3, we de¬ 
scribe the intrinsic Bernstein inequality and its applications. Afterward, we describe the new in¬ 
gredients that are required in the proofs. Section 7.4 explains how to extend the matrix Laplace 
transform method beyond the exponential function, and §7.5 describes a simple but powerful 
lemma that allows us to obtain the dependence on the intrinsic dimension. Section 7.6 contains 
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the proof of the intrinsic Chernoff bound, and §7.7 develops the proof of the intrinsic Bernstein 
bound. 

7.1 The Intrinsic Dimension of a Matrix 

Some types of random matrices are concentrated in a small number of dimensions, while they 
have little content in other dimensions. So far, our bounds do not account for the difference. We 
need to introduce a more refined notion of dimension that will help us to discriminate among 
these examples. 

Definition 7.1.1 (Intrinsic Dimension). Fora positive-semidefinite matrix A, the intrinsic dimen¬ 
sion is the quantity 

txA 

intdim( A) = -. 

II All 

We interpret the intrinsic dimension as a measure of the number of dimensions where A has 
significant spectral content. 

Let us make a few observations that support this view. By expressing the trace and the norm 
in terms of the eigenvalues, we can verify that 

1 < intdim(A) < rank(A) < dim (A). 

The first inequality is attained precisely when A has rank one, while the second inequality is at¬ 
tained precisely when A is a multiple of the identity. The intrinsic dimension is 0-homogeneous, 
so it is insensitive to changes in the scale of the matrix A. The intrinsic dimension is not mono¬ 
tone with respect to the semidefinite order. Indeed, we can drive the intrinsic dimension to one 
by increasing one eigenvalue of A substantially. 

7.2 Matrix Chernoff with Intrinsic Dimension 

Let us present an extension of the matrix Chernoff inequality. This result controls the maximum 
eigenvalue of a sum of random, positive-semidefinite matrices in terms of the intrinsic dimen¬ 
sion of the expectation of the sum. 

Theorem 7.2.1 (Matrix Chernoff: Intrinsic Dimension). Consider a finite sequence {XU of ran¬ 
dom, Hermitian matrices of the same size, and assume that 

0 < A m i n (Xfc) and A max (Xj;) < L for each index k. 

Introduce the random matrix 

Y = 'Lk X k- 

Suppose that we have a semidefinite upper bound M for the expectation E Y: 


M ^ EF = Lt EX fc- 


Define an intrinsic dimension bound and a mean bound: 


d = intdim (M) and /i max — A max (M). 
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Then, for 6 > 0, 

e 0 - i 1 

EA max (F) — Mmax + ' L log(2 d). 

o a 

Furthermore, 


(7.2.1) 


»{A max (F)>(l + £)p max }<2d- 


(1 + e) 1+£ 


f^max/L 


fore > L!p n 


The proof of this result appears below in §7.6. 


(7.2.2) 


7.2.1 Discussion 

Theorem 7.2.1 is almost identical with the parts of the basic matrix Chernoff inequality that con¬ 
cern the maximum eigenvalue A max (F). Let us call attention to the differences. The key advan¬ 
tage is that the current result depends on the intrinsic dimension of the matrix M instead of the 
ambient dimension. When the eigenvalues of M decay, the improvement can be dramatic. We 
do suffer a small cost in the extra factor of two, and the tail bound is restricted to a smaller range 
of the parameter £. Neither of these limitations is particularly significant. 

We have chosen to frame the result in terms of the upper bound M because it can be chal¬ 
lenging to calculate the mean E F exactly. The statement here allows us to draw conclusions 
directly from the upper bound M. These estimates do not follow formally from a result stated for 
E F because the intrinsic dimension is not monotone with respect to the semidefmite order. 

A shortcoming of Theorem 7.2.1 is that it does not provide any information about AminfF). 
Curiously, the approach we use to prove the result just does not work for the minimum eigen¬ 
value. 


7.2.2 Example: A Random Column Submatrix 

To demonstrate the value of Theorem 7.2.1, let us return to one of the problems we studied in 
§5.2. We can now develop a refined estimate for the expected norm of a random column subma¬ 
trix drawn from a fixed matrix. 

In this example, we consider a fixed m x n matrix B, and we let {5/A be an independent family 
of BERNOULLi(p/n) random variables. We form the random submatrix 

Z ~ 

where b k is the fcth column of B. This random submatrix contains an average of p nonzero 
columns from B. To study the norm of Z, we consider the positive-semidehnite random matrix 

Y=ZZ* = £ j 8 k b :k b* k . 

k=l 

This time, we invoke Theorem 7.2.1 to obtain a new estimate for the maximum eigenvalue of F. 

We need a semidefinite bound M for the mean E F of the random matrix. In this case, the 
exact value is available: 

M- EF = — BB*. 

n 
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We can easily calculate the intrinsic dimension of this matrix: 

tr f Ti fj ^ a || 11 ^ 

d - intdim(M) = intdimf — BB*) = intdim(BB*) =-=-1 = srank(B). 

U > \\BB* || ||B || 2 

The second identity holds because the intrinsic dimension is scale invariant. The last relation is 
simply the definition (2.1.25) of the stable rank. The maximum eigenvalue of M verifies 

Amax(M) = — A max (BB*) = ||B|| 2 . 

n n 

The maximum norm L of any term in the sum Y satisfies L = max/ c \\b-i-\\ 2 . 

We may now apply the intrinsic Chernoff inequality. The expectation bound (7.2.1) with 0 = 1 
delivers 

E ||Z|| 2 = EA max (F) < 1.72- - • ||B|| 2 +log(2srank(B)) -rnaxt ||B :fc [| 2 . 

n 

In the earlier analysis, we obtained a similar bound (5.2.1). The new result depends on the log¬ 
arithm of the stable rank instead of log m, the logarithm of the number of rows of B. When the 
stable rank of B is small—meaning that many rows are almost collinear—then the revised esti¬ 
mate can result in a substantial improvement. 

7.3 Matrix Bernstein with Intrinsic Dimension 

Next, we present an extension of the matrix Bernstein inequality. These results provide tail 
bounds for an independent sum of bounded random matrices that depend on the intrinsic di¬ 
mension of the variance. This theorem is essentially due to Stanislav Minsker. 

Theorem 7.3.1 (Intrinsic Matrix Bernstein). Consider a finite sequence {S*} of random complex 
matrices with the same size, and assume that 


E St = 0 and |] II < L. 


Introduce the random matrix 

z = L k ^- 

Let V\ and Vz he semidefinite upper bounds for the matrix-valued variances Var \{Z) andV-av-AZ): 

Vi A Vari(Z) = E(ZZ*) =Y,k E ( S k S k)> and 
V 2 A Var 2 (Z) = E(Z*Z) - Y.k E i s l s k]- 

Define an intrinsic dimension bound and a variance bound 


d = intdim 


Vi 

0 


0 

V 2 


and n = max{||Fi||, ||V 2 ll}- 


Then, for t > \J~v + LI 3, 


P {||Z|| > t} < Ad exp 


— f 2 /2 \ 
v + Ltl31 


(7.3.1) 


(7.3.2) 


The proof of this result appears below in §7.7. 
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7.3.1 Discussion 

Theorem 7.3.1 is quite similar to Theorem 6.1.1, so we focus on the differences. Although the 
statement of Theorem 7.3.1 may seem circumspect, it is important to present the result in terms 
of upper bounds Vj and V 2 for the matrix-valued variances. Indeed, it can be challenging to cal¬ 
culate the matrix-valued variances exactly. The fact that the intrinsic dimension is not monotone 
interferes with our ability to use a simpler result. 

Note that the tail bound (7.3.2) now depends on the intrinsic dimension of the block-diagonal 
matrix diag(Ei, V 2 ). This intrinsic dimension quantity never exceeds the total of the two side 
lengths of the random matrix Z. As a consequence, the new tail bound always has a better di¬ 
mensional dependence than the earlier result. The costs of this improvement are small: We pay 
an extra factor of four in the probability bound, and we must restrict our attention to a more 
limited range of the parameter t. Neither of these changes is significant. 

The result does not contain an explicit estimate for E || Z ||, but we can obtain such a bound by 
integrating the tail inequality (7.3.2). This estimate is similar with the earlier bound (6.1.3), but 
it depends on the intrinsic dimension instead of the ambient dimension. 

Corollary 7.3.2 (Intrinsic Matrix Bernstein: Expectation Bound). Instate the notation and hy¬ 
potheses of Theorem 7.3.1. Then 

E||Z|| < Const-1 \J flog(l + d)+L log(l + d) j. (7.3.3) 

See §7.7.4 for the proof. 

Next, let us have a closer look at the intrinsic dimension quantity defined in (7.3.1). 

. _ tr Vj + tr V 2 
~ max] || Vi ||, IIV 2 1|}' 

We can make a further bound on the denominator to obtain an estimate in terms of the intrinsic 
dimensions of the two blocks: 

min{intdim(17i), intdim(l7 2 )} < d < intdim(Vi) + intdim(17 2 ). (7.3.4) 

This bound reflects a curious phenomenon: the intrinsic dimension parameter d is not neces¬ 
sarily comparable with the larger of intdim(Vi) or intdim(V 2 ). 

The other commentary about the original matrix Bernstein inequality, Theorem 6.1.1, also 
applies to the intrinsic dimension result. For example, we can adapt the result to a sum of uncen¬ 
tered, independent, random, bounded matrices. In addition, the theorem becomes somewhat 
simpler for a Hermitian random matrix because there is only one matrix-valued variance to deal 
with. The modifications required in these cases are straightforward. 

7.3.2 Example: Matrix Approximation by Random Sampling 

We can apply the intrinsic Bernstein inequality to study the behavior of randomized methods for 
matrix approximation. The following result is an immediate consequence of Theorem 7.3.1 and 
Corollary 7.3.2. 

Corollary 7.3.3 (MatrixApproximation by Random Sampling: Intrinsic Dimension Bounds). Let 
B be a fixed d\ x d 2 matrix. Construct a d\ x d 2 random matrix R that satisfies 


E R-B and ||i}||<L. 
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Let Mi and M 2 be semidefinite upper bounds for the expected squares: 

M\ )>= E[RR*) and M 2 ^=E{R*R). 

Define the quantities 


d — intdim 


Mi 0 
0 M 2 


and m = max{||Mil|, HM 2 II}. 


Form the matrix sampling estimator 


R„ = -J^R k where each R k is an independent copy ofR. 


' k= 1 


Then the estimator satisfies 


EUR,, - B|| < Const- 


mlog(l + d) Llog(l + d) 


(7.3.5) 


Furthermore, for all t > \fm + LI 3, 

j -nt 2 /2 ) 

P{||Jf„-B[|> t}<4dexp -—— . (7.3.6) 

V m + 2Lt/3) 

The proof is similar with that of Corollary 6.2.1, so we omit the details. 

7.3.3 Application: Randomized Matrix Multiplication 

We will apply Corollary 7.3.3 to study the randomized matrix multiplication algorithm from §6.4. 
This method results in a small, but very appealing, improvement in the number of samples that 
are required. This argument is essentially due to Tassos Zouzias [Zoul3]. 

Our goal is to approximate the product of a di x N matrix B and an N x d 2 matrix C. We 
assume that both matrices B and C have unit spectral norm. The results are stated in terms of 
the average stable rank 

asr = |(srank(B) + srank(C)). 

The stable rank was introduced in (2.1.25). To approximate the product BC, we constructed a 
simple random matrix R whose mean ER = BC, and then we formed the estimator 

1 ” 

Fa — — R k where each R k is an independent copy of R. 

11 k =1 

The challenge is to bound the error \\R n - BC\\. 

To do so, let us refer back to our calculations from §6.4. We find that 

HJfll £ asr, 

E(RR*) =<: 2-asr-BB*, and 
E(R*R) ^ 2 • asr • C*C. 
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Starting from this point, we can quickly improve on our earlier analysis by incorporating the 
intrinsic dimension bounds. 

It is natural to set M\-2- asr • BB* and M 2 = 2 • asr • C* C. We may now bound the intrinsic 
dimension parameter 


d - intdim 


Mi 

0 


0 

m 2 


< intdim(Mi) + intdim(M 2 ) 


tr(BB*) tr(C*C) _ IIB||| ||Cll| 
IIBB* II + IIC’CII _ l|B|| 2 + lie'll 2 


= srank(B) + srank(C) = 2 • asr. 


The first inequality follows from (7.3.4), and the second is Definition 7.1.1, of the intrinsic dimen¬ 
sion. The third relation depends on the norm identities (2.1.8) and (2.1.24). Finally, we identify 
the stable ranks of B and C and the average stable rank. The calculation of the quantity m pro¬ 
ceeds from the same considerations as in §6.4. Thus, 

m — max] IIMill, ||M 2 ||} = 2-asr. 

This is all the information we need to collect. 

Corollary 7.3.3 now implies that 


E||B„-BC|| < Const- 


asr-log(l + asr) asr-log(l-i-asr) 


In other words, if the number 11 of samples satisfies 

n > e -2 • asr • log(l + asr), 


then the error satisfies 

E \\R n - BC\\ < Const - (e + e 2 ). 

In the original analysis from §6.4, our estimate for the number n of samples contained the term 
log(di + d 2 ) instead of log(l + asr). We have replaced the dependence on the ambient dimen¬ 
sion of the product BC by a measure of the stable rank of the two factors. When the average 
stable rank is small in comparison with the dimension of the product, the analysis based on the 
intrinsic dimension offers an improvement in the bound on the number of samples required to 
approximate the product. 

7.4 Revisiting the Matrix Laplace Transform Bound 

Let us proceed with the proofs of the matrix concentration inequalities based on intrinsic dimen¬ 
sion. The challenge is to identify and remedy the weak points in the arguments from Chapter 3. 

After some reflection, we can trace the dependence on the ambient dimension in our ear¬ 
lier results to the proof of Proposition 3.2.1. In the original argument, we used an exponential 
function to transform the tail event before applying Markov’s inequality. This approach leads 
to trouble for the simple reason that the exponential function does not pass through the origin, 
which gives undue weight to eigenvalues that are close to zero. 
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We can resolve this problem by using other types of maps to transform the tail event. The 
functions we have in mind are adjusted versions of the exponential. In particular, for fixed 9 > 0, 
we can consider 

flit) — max{0, e 9t - 1} and t/r 2 (f) = e Bt - Qt- 1. 

Both functions are nonnegative and convex, and they are nondecreasing on the positive real line. 
In each case, fi (0) = 0. At the same time, the presence of the exponential function allows us to 
exploit our bounds for the trace mgf. 

Proposition 7.4.1 (Generalized Matrix Laplace Transform Bound). Let Y be a random Hermitian 
matrix. Let y/: R —► P+ be a nonnegative function that is nondecreasing on [0,oo). For each t > 0, 

P{A max (F)>r}<—^-Etri/dF). 
fit) 

Proof. The proof follows the same lines as the proof of Proposition 3.2.1, but it requires some 
additional finesse. Since f is nondecreasing on [0,oo), the bound a > t implies that fia) > fit). 
As a consequence, 

^max(^) 2: t implies A max (t//(F)) > fit). 

Indeed, on the tail event A max (F) > t, we must have i^(A max (F)) > y/(t). The Spectral Mapping 
Theorem, Proposition 2.1.3, indicates that the number tydA max (F)) is one of the eigenvalues of 
the matrix fiY), so we determine that A max (i//(F)) also exceeds y/ ( t). 

Returning to the tail probability, we discover that 

P{A max (F) > t] < P{A max (i//(F)) > fit)} < EA max (y(F)). 

yt(t) 

The second bound is Markov’s inequality (2.2.1), which is valid because y/ is nonnegative. Finally, 

P {A max (F) > f} < —Etri//(F). 
fit) 

The inequality holds because of the fact (2.1.13) that the trace of f{Y), a positive-semidehnite 
matrix, must be at least as large as its maximum eigenvalue. □ 

7.5 The Intrinsic Dimension Lemma 

The other new ingredient is a simple observation that allows us to control a trace function ap¬ 
plied to a positive-semidehnite matrix in terms of the intrinsic dimension of the matrix. 

Lemma 7.5.1 (Intrinsic Dimension). Lettp be a convex function on the interval [0,oo), and assume 
that (pi 0) = 0. For any positive-semidefinite matrix A, it holds that 

trcp(A) < intdim(A) -<p(|| A||). 

Proof. Since the function a >-► <p(a) is convex on the interval [0, L\, it is bounded above by the 
chord connecting the graph at the endpoints. That is, for a e [0, L ], 


I a\ a a 

fia) < [l- -] -<p(0) + - -(piL) = --(piL). 
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The eigenvalues of A fall in the interval [0 ,L], where L = ||A||. As an immediate consequence of 
the Transfer Rule (2.1.14), we find that 

tr cp{A)< p--^(||i4||). 

Identify the intrinsic dimension of A to complete the argument. □ 


7.6 Proof of the Intrinsic Chernoff Bound 


With these results at hand, we are prepared to prove our first intrinsic dimension result, which 
extends the matrix Chernoff inequality. 


Proof of Theorem 7.2.1. Consider a finite sequence {X^} of independent, random Hermitian ma¬ 
trices with 

0<A m i n (Xj;) and A ma x (Xjt)<L for each index k. 

Introduce the sum 

Y = Z k *k. 

The challenge is to establish bounds for A max (K) that depend on the intrinsic dimension of a 
matrix M that satisfies M )>= E Y. We begin the argument with the proof of the tail bound (7.2.2). 
Afterward, we show how to extract the expectation bound (7.2.1). 

Fix a number 6 > 0, and define the function y/[t) = max{0, e 0t - 1} for t e R. For t > 0, the 
general version of the matrix Laplace transform bound, Proposition 7.4.1, states that 

IPUmaxtm r}< -^-Etri//(F) = Etr(e 0F -l). (7.6.1) 

iKf) e 0f -l 


We have exploited the fact that Y is positive semidefinite and the assumption that t > 0. The 
presence of the identity matrix on the right-hand side allows us to draw stronger conclusions 
than we could before. 

Let us study the expected trace term on the right-hand side of (7.6.1). As in the proof of the 
original matrix Chernoff bound, Theorem 5.1.1, we have the estimate 


Etre 0F < trexp(g(0)-EF) 


where 



Introduce the function tp{a) = e a - 1, and observe that 

Etr(e 0F -1) < tr(e gl0) ' EF -1) < tr(e g(0) ' M -1) = tr ip(g(G) ■ M). 


The second inequality follows from the assumption that E F ^ M and the monotonicity (2.1.16) 
of the trace exponential. Now, apply the intrinsic dimension bound, Lemma 7.5.1, to reach 

Etr(e 0F - I) < intdim(M) • q> ( g(6 ) l|M||). 

We have used the fact that the intrinsic dimension does not depend on the scaling factor g(0). 
Recalling the notation d — intdim(M) and /i max = II M \\, we continue the calculation: 

Etr(e 0F - 1 ) < d ■ q> (g(0) ■ p max ) < d ■ e g(0) ' Mma \ 


(7.6.2) 
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We have invoked the trivial inequality (p[a) < e a , which holds for a e M. 

Next, introduce the bound (7.6.2) on the expected trace into the probability bound (7.6.1) to 
obtain 

P{-*max(r) > t] <d~ - e -0f+g(0)-/w < d . ( 1+ . e -St+g(ey^_ (7.6.3) 

e Bt -\ V dt) 

To control the fraction, we have observed that 


e a 1 1 

- — 1 H-< 1 H- 

e a -1 e a -l a 


for a > 0. 


We obtain the latter inequality by replacing the convex function a e a - 1 with its tangent line 
at a — 0. 

In the estimate (7.6.3), we make the change of variables f ■—► (1 + e)/i max - The bound is valid 
for all 6 > 0, so we can select 6 — LT 1 log(l + e) to minimize the exponential. Altogether, these 
steps lead to the estimate 


P {Amax(F) — (1 + £)p max} — d ' 1 + 


L /Pn 


(l + £)log(l + £) 


(l + e) 1+£ 


Pmax/ L 


(7.6.4) 


Now, instate the assumption that e > L//i max . The function a >-* (1 + a) logd + a) is convex when 
a > -1, so we can bound it below using its tangent at e — 0. Thus, 


(1 + e) log(l + £)>£>- 

Pmax 


It follows that the parenthesis in (7.6.4) is bounded by two, which yields the conclusion (7.2.2). 

Now, we turn to the expectation bound (7.2.1). Observe that the functional inverse of if/ is the 
increasing concave function 

i// -1 (it) = -log(l+ u) for u > 0. 

9 

Since Y is a positive-semidefinite matrix, we can calculate that 
EA max (F) = Ei/T V(A 

max (F)))< y/~\EyrU max (F))) 

= V _1 (EA ma x(^(F))) < i// _1 (Etri//(F)). (7.6.5) 


The second relation is Jensen’s inequality (2.2.2) , which is valid because y /~ 1 is concave. The third 
relation follows from the Spectral Mapping Theorem, Proposition 2.1.3, because the function if/ 
is increasing. We can bound the maximum eigenvalue by the trace because y/(Y) is positive 
semidehnite and i//~ 1 is an increasing function. 

Now, substitute the bound (7.6.2) into the last display (7.6.5) to reach 

EAmax(F) <y~ l (d- exp(g(0) • p max )) = \ log(l + d ■ 

u 

< y log(2 d ■ e gi6) '^) = Y (log(2 d) + g(0) • p max ). 

The first inequality again requires the property that i//~ 1 is increasing. The second inequality 
follows because 1 < d ■ , which is a consequence of the fact that the intrinsic dimension 

exceeds one and the exponent is nonnegative. To complete the argument, introduce the defini¬ 
tion of g(0), and make the change of variables 9 <->■ 9IL. These steps yield (7.2.1). □ 
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7.7 Proof of the Intrinsic Bernstein Bounds 

In this section, we present the arguments that lead up to the intrinsic Bernstein bounds. That is, 
we develop tail inequalities for an independent sum of bounded random matrices that depend 
on the intrinsic dimension of the variance. 


7.7.1 The Hermitian Case 

As usual, Hermitian matrices provide the natural setting for matrix concentration. We begin with 
an explicit statement and proof of a bound for the Hermitian case. 

Theorem 7.7.1 (Matrix Bernstein: Hermitian Case with Intrinsic Dimension). Consider a finite 
sequence {X^} of random Hermitian matrices of the same size, and assume that 

EXt = 0 and A max (XO < L for each index k. 

Introduce the random matrix 

Y = L k ^. 

Let V be a semidefinite upper bound for the matrix-valued variance Var ( Y ) : 

V )p= Var(F) =EY Z = ^j.EX^. 

Define the intrinsic dimension bound and variance bound 

d — intdim(F) and v - || Vj|. 


Then, for t > y r v + LI3, 

P{A max [Y)>t}<4d-expl (7.7.1) 

V v + Lt/3) 

The proof of this result appears in the next section. 

7.7.2 Proof of the Hermitian Case 

We commence with the results for an independent sum of random Hermitian matrices whose 
eigenvalues are subject to an upper bound. 

Proof of Theorem 7.7.1. Consider a finite sequence {Xfii of independent, random, Hermitian ma¬ 
trices with 

EXj; = 0 and A max (XO < L for each index Ic. 

Introduce the random matrix 

y = £***• 

Our goal is to obtain a tail bound for A max (F) that reflects the intrinsic dimension of a matrix V 
that satisfies V )>= Var(F). 
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Fix a number 9 > 0, and define the function y/(t) = e et — 9t— 1 for teR. The general version 
of the matrix Laplace transform bound, Proposition 7.4.1, implies that 


P>{A max (F)>t}<-^-Etrig(F) 

= —^ Etr(e 0y -0F-ll 


1 

e et -9t- 1 


Etr(e 0F -1). 


(7.7.2) 


The last identity holds because the random matrix F has zero mean. 

Let us focus on the expected trace on the right-hand side of (7.7.2). Examining the proof of 
the original matrix Bernstein bound for Hermitian matrices, Theorem 6.6.1, we see that 


Etre 0F < trexp (g(0) • E Y 2 ) 


where 


g(0) = exp 


9 2 /2 \ 
1-BLI3}' 


Introduce the function (p[a) = e a - 1, and observe that 

Etr(e 0F -1) < tr(e g(0) ' E y2 -1) < tr(e gt0) ' F -1) = tr ip[g(0) • V). 

The second inequality depends on the assumption that E Y 2 - Var(F) =<I V and the monotonicity 
property (2.1.16) of the trace exponential. Apply the intrinsic dimension bound, Lemma 7.5.1, to 
reach 

Etr(e 0F -1) < intdim(F) -<p(g(0) • || V||) = d-(p(g{6)-v) < d-e gm ' v . (7.7.3) 

We have used the fact that the intrinsic dimension is scale invariant. Then we identified d = 
intdim(P) and v - || V |]. The last inequality depends on the trivial estimate tpia) < e a , valid for all 
aeK. 

Substitute the bound (7.7.3) into the probability inequality (7.7.2) to discover that 

P{A max (F) > f} < d■ -z-^- -- • e~ et+gie) - v <d- f 1 + • e~ et+gm - v . (7.7.4) 

e yt -9t-l 1, 9 Z H) 


This estimate holds for any positive value of 9. To control the fraction, we have observed that 


e 1 + a 3 

-= 1 +-< 1 H —- for all a > 0. 

e a -a -1 e a -a -1 a z 


The inequality above is a consequence of the numerical fact 


- 7. -> 0 for all aeU. 

a 2 3 

Indeed, the left-hand side of the latter expression defines a convex function of a, whose minimal 
value, attained near a ~ 1.30, is strictly positive. 

In the tail bound (7.7.4), we select 9 - t!{v + Lt!3 ) to reach 


P{A max (F)>f}<d- 1 + 3 


(v + Lt/3) 2 


•exp 


-t 2 l 2 \ 
v + Lt/3) 
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This probability inequality is typically vacuous when t 2 < v + Ltl 3, so we may as well limit our 
attention to the case where t 2 > v + Ltl 3. Under this assumption, the parenthesis is bounded 
by four, which gives the tail bound (7.7.1). We can simplify the restriction on f by solving the 
quadratic inequality to obtain the sufficient condition 


1 

L 

L 2 

f > - 

- +1 

/- YAv 

2 

3 \ 

9 


We develop an upper bound for the right-hand side of this inequality as follows. 


L 

L 2 

L 

, L 36ii 

L 

L , 6-v/l* 1 


Iy +4v 

~ 6 

1 + \ 1 + 1F 

< — 

6 

1 + 1+ —— 
L 


= V^+-. 
3 


We have used the numerical fact Va+ b < y'a+s/b for all a, h > 0. Therefore, the tail bound (7.7.1) 
is valid when t > >Jv + LI 3. □ 


7.7.3 Proof of the Rectangular Case 

Finally, we present the proof of the intrinsic Bernstein inequality, Theorem 7.3.1, for general 
random matrices. 


Proof of Theorem 7.3.1. Suppose that {Sfs is a finite sequence of independent random matrices 
that satisfy 

ESfc = 0 and ||Sj;||<L for each index A;. 

Form the sum Z - XfcSfc. As in the proof of Theorem 6.1.1, we derive the result by applying 
Theorem 7.7.1 to the Hermitian dilation Y - Jlf{Z). The only new point that requires attention 
is the modification to the intrinsic dimension and variance terms. 

Recall the calculation of the variance of the dilation from (2.2.9): 


EY 2 = EJi?(Z) 2 


Van (Z) 

0 

=4 

V, 

0 

0 

Var 2 (Z) 

0 

V 2 


The semidefmite inequality follows from our assumptions on V\ and If. Therefore, the intrinsic 
dimension quantity in Theorem 7.7.1 induces the definition in the general case: 


d - intdim(V) = intdim 


Pi 

0 


0 

V 2 


Similarly, 


v = IIP|| = max{ ||Fill, ||V 2 ||}. 


This point completes the argument. 


□ 


7.7.4 Proof of the Intrinsic Bernstein Expectation Bound 

Finally, let us establish the expectation bound, Corollary 7.3.2, that accompanies Theorem 7.3.1. 
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Proof of Corollary 7.3.2. Fix a number /i > \Jv. We may rewrite the expectation of \\Z\\ as an 
integral: 

-t 2 /2 


E||Z|| 


poo poo / 

= / P{||Z|| > t] dt < p + 4 d / exp — 

J0 Ju l V 


+ Lf/3 


df 


-3f/(2L) 


df 


J r*oo po 

e -t K.2v)d t + 4d / 

P Jfl 

<p + 4 d^e-^W + ^dLe-WM. 

To obtain the first inequality, we split the integral at p, and we bound the probability by one on 
the domain of integration [0, p]. The second inequality holds because 

-r 2 /2 


exp 


< max 


k ,: 


: 2 /( 2v) 0 -3tl(2L )\< e -t 2 l(2v) 0 -3f/(2I) 


f 


+ e 


„ v + Ltl3) 

We controlled the Gaussian integral by inserting the factor \fT’{tl v) > 1 into the integrand: 

\tlv)e~ ei{ 2 v) dt= ^e-^ n2v) . 


f 

Ju 


J d t< 


pc 

Vv / 

Ju 


To complete the argument, select p = ^/2idog(l + d ) + |ilog(l + d) to reach 

E ||Z|| <p + 4rf^e- (2!,1 °g tl+d))/l2l ' ) + ^Le- 3((2/3)il °g (1+d))/t2i) 


< \J 2;dog(l + d) + -Llog(l + d) + 4 y/v+ -L. 

The stated bound (7.3.3) follows after we combine terms and agglomerate constants. 


□ 


7.8 Notes 

At present, there are two different ways to improve the dimensional factor that appears in matrix 
concentration inequalities. 

First, there is a sequence of matrix concentration results where the dimensional parameter 
is bounded by the maximum rank of the random matrix. The first bound of this type is due to 
Rudelson [Rud99]. Oliveira’s results in [OlilOb] also exhibit this reduced dimensional depen¬ 
dence. A subsequent paper [MZ11] by Magen & Zouzias contains a related argument that gives 
similar results. We do not discuss this class of bounds here. 

The idea that the dimensional factor should depend on metric properties of the random ma¬ 
trix appears in a paper of Ffsu, Kakade, & Zhang [HKZ12]. They obtain a bound that is similar 
with Theorem 7.7.1. Unfortunately, their argument is complicated, and the results it delivers are 
suboptimal. 

Theorem 7.7.1 is essentially due to Stanislav Minsker [Minll]. His approach leads to some¬ 
what sharper bounds than the approach in the paper of Hsu, Kakade, & Zhang, and his method 
is easier to understand. 

These notes contain another approach to intrinsic dimension bounds. The intrinsic Chernoff 
bounds that emerge from our framework are new. The proof of the intrinsic Bernstein bound, 
Theorem 7.7.1, can be interpreted as a distillation of Minsker’s argument. Indeed, many of the 
specific calculations already appear in Minsker’s paper. We have obtained constants that are 
marginally better. 







A Proof of Lieb’s Theorem 


Our approach to random matrices depends on some sophisticated ideas that are not usually pre¬ 
sented in linear algebra courses. This chapter contains a complete derivation of the results that 
undergird our matrix concentration inequalities. We begin with a short argument that explains 
how Lieb’s Theorem follows from deep facts about a function called the matrix relative entropy. 
The balance of the chapter is devoted to an analysis of the matrix relative entropy. Along the way, 
we establish the core properties of the trace exponential function and the matrix logarithm. This 
discussion may serve as an introduction to the advanced techniques of matrix analysis. 

8.1 Lieb’s Theorem 

In his 1973 paper on trace functions, Lieb established an important concavity theorem [Lie73, 
Thm. 6] for the trace exponential function. As we saw in Chapter 3, this result animates all of our 
matrix concentration inequalities. 

Theorem 8.1.1 (Lieb). Let H be a fixed Hermitian matrix with dimension d. The map 

A '—► trexp(H + logA) (8.1.1) 

is concave on the convex cone ofd x d positive-definite Hermitian matrices. 

Section 8.1 contains an overview of the proof of Theorem 8.1.1. First, we state the background 
material that we require, and then we show how the theorem follows. Some of the supporting 
results are major theorems in their own right, and the details of their proofs will consume the 
rest of the chapter. 

8.1.1 Conventions 

The symbol R_refers to the set of positive real numbers. We remind the reader of our conven¬ 

tion that bold capital letters that are symmetric about the vertical axis (A, H, M, T, U, T) always 
refer to Hermitian matrices. We reserve the letter I for the identity matrix, while the letter Q 
always refers to a unitary matrix. Other bold capital letters (B, K, L ) denote rectangular matrices. 
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Unless stated otherwise, the results in this chapter hold for all matrices whose dimensions are 
compatible. For example, any result that involves a sum A + H includes the implicit constraint 
that the two matrices are the same size. 

Throughout this chapter, we assume that the parameter r e [0,1], and we use the shorthand 
f = 1 - t to make formulas involving convex combinations more legible. 

8.1.2 Matrix Relative Entropy 

The proof of Lieb’s Theorem depends on the properties of a bivariate function called the matrix 
relative entropy. 

Definition 8.1.2 (Matrix Relative Entropy). Let A and H be positive-definite matrices of the same 
size. The entropy of A relative to H is 

D{A;H) = tr[A(logi4-logiT)-(j4-iT)]. 

The relative entropy can be viewed as a measure of the difference between the matrix A and the 
matrix H, but it is not a metric. Related functions arise in quantum statistical mechanics and 
quantum information theory. 

We need two facts about the matrix relative entropy. 

Proposition 8.1.3 (Matrix Relative Entropy is Nonnegative). For positive-definite matrices A and 
H of the same size, the matrix relative entropy D(A; H) > 0. 

Proposition 8.1.3 is easy to prove; see Section 8.3.5 for the short argument. 

Theorem 8.1.4 (The Matrix Relative Entropy is Convex). The map (A , H) * D(A; H ) is convex. 
That is, for positive-definite A/ and H/ of the same size, 

D[tA 1 + f A 2 ; tHi + iH 2 ) < r -D(Ai; Hi) + f • D(A 2 ; H 2 ) fori e [0,1], 

Theorem 8.1.4 is one of the crown jewels of matrix analysis. The supporting material for this 
result occupies the bulk of this chapter; the argument culminates in Section 8.8. 

8.1.3 Partial Maximization 

We also require a basic fact from convex analysis which states that partial maximization of a 
concave function produces a concave function. We include the simple proof. 

Fact 8.1.5 (Partial Maximization). Let f be a concave function of two variables. Then the function 
y sup,- fix; y) obtained by partial maximization is concave. 

Proof. Fix e > 0. For each pair of points y\ and y 2 , there are points x\ and x 2 that satisfy 
/(xi;yi) >sup*/(x;yi)-£ and fix 2 ;y 2 ) > sup*/(x;y 2 ) -e. 

For each re [0,1], the concavity of / implies that 

sup x fix; ryi + fy 2 ) > firx i + fx 2 ; ryi + fy 2 ) 

>r-/(xi;yi) + f -/(x 2 ;y 2 ) 

> t • sup* fix; yi) + f • sup* fix; y 2 ) - e. 

Take the limit as £ ] 0 to see that the partial supremum is a concave function of y. □ 


8.2. ANALYSIS OF THE RELATIVE ENTROPY FOR VECTORS 
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8.1.4 A Proof of Lieb’s Theorem 

Taking the results about the matrix relative entropy for granted, it is not hard to prove Lieb’s 
Theorem. We begin with a variational representation of the trace, which restates the fact that 
matrix relative entropy is nonnegative. 

Lemma 8.1.6 (Variational Formula for Trace). LetM be a positive-definite matrix. Then 

tr M - sup tr[riogM- riogT+ r]. 

T> 0 

Proof. Proposition (8.1.3) states that D(T; M) > 0. Introduce the definition of the matrix relative 
entropy, and rearrange to reach 


trM> tr[TlogM- TlogT+ T], 

When T - M, both sides are equal, which yields the advertised identity. 0 

To establish Lieb’s Theorem, we use the variational formula to represent the trace exponen¬ 
tial. Then we use the partial maximization result to condense the desired concavity property 
from the convexity of the matrix relative entropy. 

Proof of Theorem 8.1.1. In the variational formula, Lemma 8.1.6, select M = exp(L/ + log A) to 
obtain 

trexp(H-t-logA) = sup tr[T(J/ + logA) - Tlogr + T] 

T>0 

The latter expression can be written compactly using the matrix relative entropy: 

trexp(H + log A) = sup [tr(TH) + tr A- D(T; A)] (8.1.2) 

T>0 

For each Hermitian matrix H, the bracket is a concave function of the pair [T,A) because of 
Theorem 8.1.4. We see that the right-hand side of (8.1.2) is the partial maximum of a concave 
function, and Fact 8.1.5 ensures that this expression defines a concave function of A. This obser¬ 
vation establishes the theorem. fit 

8.2 Analysis of the Relative Entropy for Vectors 

Many deep theorems about matrices have analogies for vectors. This observation is valuable 
because we can usually adapt an analysis from the vector setting to establish the parallel result 
for matrices. In the matrix setting, however, it may be necessary to install a significant amount of 
extra machinery. If we keep the simpler structure of the vector argument in mind, we can avoid 
being crushed in the gears. 

8.2.1 The Relative Entropy for Vectors 

The goal of §8.2 is to introduce the relative entropy function for positive vectors and to derive 
some key properties of this function. Later we will analyze the matrix relative entropy by emu¬ 
lating these arguments. 
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Definition 8.2.1 (Relative Entropy). Let a and h be positive vectors of the same size. The entropy 
of a relative to h is defined as 

D [a- *)=£,[«* (log a k - log h k ) - [a k - h k )}. 

A variant of the relative entropy arises in information theory and statistics as a measure of the 
discrepancy between two probability distributions on a finite set. We will show that the relative 
entropy is nonnegative and convex. 

It may seem abusive to recycle the notation for the relative entropy on matrices. To justify 
this decision, we observe that 


D(a; h) = D(diag(a); diag(fi)) 

where diag(-) maps a vector to a diagonal matrix in the natural way. In other words, the vector 
relative entropy is a special case of the matrix relative entropy. Ultimately, the vector case is 
easier to understand because diagonal matrices commute. 

8.2.2 Relative Entropy is Nonnegative 

As we have noted, the relative entropy measures the difference between two positive vectors. 
This interpretation is supported by the fact that the relative entropy is nonnegative. 

Proposition 8.2.2 (Relative Entropy is Nonnegative). For positive vectors a and h of the same size, 
the relative entropy 11 (a; h) > 0. 

Proof Let /: K++ —<• R be a differentiable convex function on the positive real line. The function 
/ lies above its tangent lines, so 

f{a ) > f{h) + f (fi) • [a - h) for positive a and h. 

Instantiate this result for the convex function f{a) - aloga - a, and rearrange to obtain the nu¬ 
merical inequality 

a (log a-log fi) - {a - h) > 0 for positive a and h. 

Sum this expression over the components of the vectors a and h to complete the argument. □ 

Proposition 8.1.3 states that the matrix relative entropy satisfies the same nonnegativity prop¬ 
erty as the vector relative entropy. The argument for matrices relies on the same ideas as Propo¬ 
sition 8.2.2, and it is hardly more difficult. See §8.3.5 for the details. 

8.2.3 The Perspective Transformation 

Our next goal is to prove that the relative entropy is a convex function. To establish this claim, 
we use an elegant technique from convex analysis. The approach depends on the perspective 
transformation, a method for constructing a bivariate convex function from a univariate convex 
function. 

Definition 8.2.3 (Perspective Transformation). Let f : K++ -*■ R be a convex function on the posi¬ 
tive real line. The perspective iff j- of the function f is defined as 


y/f : IR++ x R ++ —► [R where if/f{a\h) — a-f(h! a). 
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The perspective transformation has an interesting geometric interpretation. If we trace the ray 
from the origin (0,0,0) in l 3 through the point (a, h, i// j (a\ h)), it pierces the plane (1,-,■) at the 
point (1, hi a, f {hi a)). Equivalently, for each positive a, the epigraph of / is the “shadow” of the 
epigraph of y/f{a, •) on the plane (1, •,•)■ 

The key fact is that the perspective of a convex function is convex. This point follows from 
the geometric reasoning in the last paragraph; we also include an analytic proof. 

Fact 8.2.4 (Perspectives are Convex). Let f : R ++ —► IR be a convex function. Then the perspective 
y/j is convex. That is, for positive numbers at and hi, 


iff fit a\ +1 a 2 ‘, th\ + t h 2 ) < t -yr f{a\\ h\) + r -y/ f{a 2 ‘, h 2 ) fort e [0,1], 

Proof. Fix two pairs [ai,hi) and [a 2 ,h 2 ) of positive numbers and an interpolation parameter 
re [0,1]. Form the convex combinations 

a-tai + fa 2 and h-th\ + th 2 . 

We need to bound the perspective y/f {a; h) as the convex combination of its values at y/j- [ap, h \) 
and y/f{a 2 ‘, h 2 ). The trick is to introduce another pair of interpolation parameters: 

T Mi j _ T CI2 

s = - and 5 =-. 

a a 

By construction, se [0,1] and s = 1 - s. We quickly determine that 

y/f {a-, h) = a-f[hla ) 

= a-f[th\l a + ili 2 l a) 

= a- f[s- h\ta\ + s- ^ 2 /^ 2 ) 

< a[s■ f{hilaO + s■ f{h 2 la 2 )} 

= r- a\ ■ f[h\lai) + f-a2-f{h 2 la2) 

= t ■ y/f{ai;hi) + f • y/f{a 2 ; h 2 ). 

To obtain the second identity, we write h as a convex combination. The third identity follows 
from the definitions of s and s. The inequality depends on the fact that / is convex. Afterward, 
we invoke the definitions of ,v and ,v again. We conclude that y/j- is convex. £3 

When we study standard matrix functions, it is sometimes necessary to replace a convexity 
assumption by a stricter property called operator convexity. There is a remarkable extension of 
the perspective transform that constructs a bivariate matrix function from an operator convex 
function. The matrix perspective has a powerful convexity property analogous with the result in 
Fact 8.2.4. The analysis of the matrix perspective depends on a far-reaching generalization of the 
Jensen inequality for operator convex functions. We develop these ideas in §§8.4.5, 8.5, and 8.6. 

8.2.4 The Relative Entropy is Convex 

To establish that the relative entropy is convex, we simply need to represent it as the perspective 
of a convex function. 
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Proposition 8.2.5 (Relative Entropy is Convex). The map [a, h) D {a; h) is convex. That is, for 
positive vectors dj andhj of the same size, 

D(tai + f «2; + fhf) < t hi) + f • D(fl2; hi) f orT e [0,1], 

Proof. Consider the convex function f{a ) — a— 1 -loga, defined on the positive real line. By 
direct calculation, the perspective transformation satisfies 

h) — a(logn- logh) - {a - h) for positive a and h. 

Fact 8.2.4 states that y/f is a convex function. For positive vectors a and h, we can express the 
relative entropy as 

D [a;h) - f{a k ;h k ), 

It follows that the relative entropy is convex. □ 

Similarly, we can express the matrix relative entropy using the matrix perspective transfor¬ 
mation. The analysis for matrices is substantially more involved. But, as we will see in §8.8, the 
argument ultimately follows the same pattern as the proof of Proposition 8.2.5. 

8.3 Elementary Trace Inequalities 

It is time to begin our investigation into the properties of matrix functions. This section contains 
some simple inequalities for the trace of a matrix function that we can establish by manipulating 
eigenvalues and eigenvalue decompositions. These techniques are adequate to explain why the 
matrix relative entropy is nonnegative. In contrast, we will need more subtle arguments to study 
the convexity properties of the matrix relative entropy. 

8.3.1 Trace Functions 

We can construct a real-valued function on Hermitian matrices by composing the trace with a 
standard matrix function. This type of map is called a trace function. 

Definition 8.3.1 (Trace function). Let /': / —- IR he a function on an interval I of the real line, and 
let A be an Hermitian matrix whose eigenvalues are contained in I. We define the trace function 
tr f by the rule 

tr/(A) = £./(Aj(A)), 

where A,; (A) denotes the i th largest eigenvalue of A. This formula gives the same result as compos¬ 
ing the trace with the standard matrix function f. 

Our first goal is to demonstrate that a trace function tr / inherits a mono tonicity property from 
the underlying scalar function /. 

8.3.2 Monotone Trace Functions 

Let us demonstrate that the trace of a weakly increasing scalar function induces a trace function 
that preserves the semidefinite order. To that end, recall that the relation A =4 H implies that 
each eigenvalue of A is dominated by the corresponding eigenvalue of H. 
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Fact 8.3.2 (Semidefinite Order implies Eigenvalue Order). For Hermitian matrices A and H, 

A Y H implies Ad A) < AfH) for each index i. 

Proof. This result follows instantly from the Courant-Fischer Theorem: 

u* Au u* Hu 

Aj{A) = max min-< max min- -Ai(H). 

diml=i «€l U* U dimL=i ueL U* U 

The maximum ranges over all /-dimensional linear subspaces L in the domain of A, and we use 
the convention that 0/0 = 0. The inequality follows from the definition (2.1.11) of the semidefi¬ 
nite order =*• □ 

With this fact at hand, the claim follows quickly. 

Proposition 8.3.3 (Monotone Trace Functions). Let f : I —► IR be a weakly increasing function 
on an interval I of the real line, and let A and H be Hermitian matrices whose eigenvalues are 
contained in I. Then 

A^H implies trf[A) < tr /(H). 


Proof. In view of Fact 8.3.2, 

tr f{A) = ^ifUilA)) s^ifUdH)) = tr/(H). 

The inequality depends on the assumption that / is weakly increasing. □ 

Our approach to matrix concentration relies on a special case of Proposition 8.3.3. 

Example 8.3.4 (Trace Exponential is Monotone). The trace exponential map is monotone: 

A Y H implies tre A < tre ,/ 
for all Hermitian matrices A and H. 


8.3.3 Eigenvalue Decompositions, Redux 

Before we continue, let us introduce a style for writing eigenvalue decompositions that will make 
the next argument more transparent. Each dx d Hermitian matrix A can be expressed as 

d 

A = iUiU*. 

i= 1 

The eigenvalues X\ > • • • > of A are real numbers, listed in weakly decreasing order. The family 
{mi,. .., Mrf} of eigenvectors of A forms an orthonormal basis for C d with respect to the standard 
inner product. 
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8.3.4 A Trace Inequality for Bivariate Functions 

In general, it is challenging to study functions of two or more matrices because the eigenvectors 
can interact in complicated ways. Nevertheless, there is one type of relation that always transfers 
from the scalar setting to the matrix setting. 

Proposition 8.3.5 (Generalized Klein Inequality). Let /,-: I —► 0? and gj : I —<• R be functions on an 
interval I of the real line, and suppose that 

E,- fi(a)gi(h) > 0 for all a, he I. 

If A and H are Hermitian matrices whose eigenvalues are contained in I, then 

ZMfdA) gi [H)]>0. 

Proof. Consider eigenvalue decompositions A - X / A j Uj u* and H-Y.kRk v k v k ■ Then 

E/ tr [fi (A) Si(H )] = Ei tr [(Ey fi Wy ) 11 i u *j )(Eit Si frk) VkV * k )] 

=E/.fc [Lift^SiUik)] ■ k uj, v k )\ 2 > o. 

We use the definition of a standard matrix function, we apply linearity of the trace to reorder 
the sums, and we identify the trace as a squared inner product. The inequality follows from our 
assumption on the scalar functions. □ 

8.3.5 The Matrix Relative Entropy is Nonnegative 

Using the generalized Klein inequality, it is easy to prove Proposition 8.1.3, which states that the 
matrix relative entropy is nonnegative. The argument echoes the analysis in Proposition 8.2.2 for 
the vector case. 

Proof of Proposition 8.1.3. Suppose that / : M++ —*• M is a differentiable, convex function on the 
positive real line. Since / is convex, the graph of f lies above its tangents: 

f{a) > f{h) + f\h){a- h) for positive a and h. 

Using the generalized Klein inequality, Proposition 8.3.5, we can lift this relation to matrices: 

tr/(A) > tr[/(iT) + f{H)(A-H)] for all positive-definite A and H. 

This formula is sometimes called the (ungeneralized) Klein inequality. 

Instantiate the latter result for the function f(a ) = a log a - a, and rearrange to see that 

D(A; H) - tr [ A (log A -logff) - [A- H)] > 0 for all positive-definite A and H. 

In other words, the matrix relative entropy is nonnegative. □ 

8.4 The Logarithm of a Matrix 

In this section, we commence our journey toward the proof that the matrix relative entropy is 
convex. The proof of Proposition 8.2.5 indicates that the convexity of the logarithm plays an 
important role in the convexity of the vector relative entropy. As a first step, we will demonstrate 
that the matrix logarithm has a striking convexity property with respect to the semidefmite order. 
Along the way, we will also develop a monotonicity property of the matrix logarithm. 
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8.4.1 An Integral Representation of the Logarithm 

Initially, we defined the logarithm of a d x d positive-definite matrix A using an eigenvalue de¬ 
composition: 



TogAi 



Ai 


log A = Q 


log Ad 

Q* where A — Q 


Ad. 


To study how the matrix logarithm interacts with the semidehnite order, we will work with an 
alternative presentation based on an integral formula. 


Proposition 8.4.1 (Integral Representation of the Logarithm). The logarithm of a positive num¬ 
ber a is given by the integral 


fOO 1 x 

1 

Jo [ 1 + u 

a+ u 


d u. 


Similarly, the logarithm of a positive-definite matrix A is given by the integral 


log A: 


POO 

: [( 1 + U )- 1 

JO 


I- (A + ztl) _1 l d u. 


Proof. To verify the scalar formula, we simply use the definition of the improper integral: 


r°° I l l 

Jo [l + u a + u 


d u - lim 

L—*oc 


fl— 

Jo 1 1 + u 


+ u a+u 


d u 


= lim [log(l + u)- log(a + z<)] u=0 

L—>oo 


■ logfl+ lim log 

L-* oo 


1 + L 
a + L 


■ log a. 


We obtain the matrix formula by applying the scalar formula to each eigenvalue of A and then 
expressing the result in terms of the original matrix. □ 

The integral formula from Proposition 8.4.1 is powerful because it expresses the logarithm 
in terms of the matrix inverse, which is much easier to analyze. Although it may seem that we 
have pulled this representation from thin air, the approach is motivated by a wonderful theory 
of matrix functions initiated by Lowner in the 1930s. 


8.4.2 Operator Monotone Functions 

Our next goal is to study the monotonicity properties of the matrix logarithm. To frame this 
discussion properly, we need to introduce an abstract definition. 

Definition 8.4.2 (Operator Monotone Function). Let f : I —> IR be a function on an interval I of 
the real line. The function f is operator monotone on I when 

A^4 H implies f{A) X f[H) 

for all Hermitian matrices A and H whose eigenvalues are contained in I. 
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Let us state some basic facts about operator monotone functions. Many of these points follow 
easily from the definition. 

• When j6 > 0, the weakly increasing affine function t >-»■ a + /31 is operator monotone on each 
interval I of the real line. 

• The quadratic function f ■—► f 2 is not operator monotone on the positive real line. 

• The exponential map t >-»■ e f is not operator monotone on the real line. 

• When a > 0 and / is operator monotone on I, the function af is operator monotone on 1. 

• If / and g are operator monotone on an interval I, then f+g is operator monotone on I. 

These properties imply that the operator monotone functions form a convex cone. It also warns 
us that the class of operator monotone functions is somewhat smaller than the class of weakly 
increasing functions. 

8.4.3 The Negative Inverse is Operator Monotone 

Fortunately, interesting operator monotone functions do exist. Let us present an important ex¬ 
ample related to the matrix inverse. 

Proposition 8.4.3 (Negative Inverse is Operator Monotone). For each number u > 0, the function 
a >-► -[a+uH 1 is operator monotone on the positive real line. That is, for positive-definite matrices 
A and H, 

A^.H implies - (A + ul) -1 -{H+ nl) _1 . 

Proof. Define the matrices A u - A+ ul and f/„ = H+ ul. The semidefinite relation A AH implies 
that A u =4 H u . Apply the Conjugation Rule (2.1.12) to see that 

o< H“ 1/2 a„jt ; ; 1/2 ^i. 

When a positive-definite matrix has eigenvalues bounded above by one, its inverse has eigenval¬ 
ues bounded below by one. Therefore, 

\^{H- u ll2 A u H- ll2 y l = H]! 2 A- u l H]! 2 . 

Another application of the Conjugation Rule (2.1.12) delivers the inequality H~ l =4 A~ t l . Finally, 
we negate this semidefinite relation, which reverses its direction. □ 

8.4.4 The Logarithm is Operator Monotone 

Now, we are prepared to demonstrate that the logarithm is an operator monotone function. The 
argument combines the integral representation from Proposition 8.4.1 with the monotonicity of 
the inverse map from Proposition 8.4.3. 

Proposition 8.4.4 (Logarithm is Operator Monotone). The logarithm is an operator monotone 
function on the positive real line. That is, for positive-definite matrices A and H, 


A A H implies log A 4 l°g H. 
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Proof. For each u > 0, Proposition 8.4.3 demonstrates that 

(1 + u )~ l I - (A + ;d) _1 =<: (1 + u )~ l I - (H + illy 1 . 

The integral representation of the logarithm, Proposition 8.4.1, allows us to calculate that 

J r*oo noo 

[(1 + zi) _1 I- [A + z/I) —1 ] du =4 | [(1 + w) -1 I- [H+ Ml) -1 ] du — logfT. 

o Jo 

We have used the fact that the semidehnite order is preserved by integration against a positive 
measure. □ 

8.4.5 Operator Convex Functions 

Next, let us investigate the convexity properties of the matrix logarithm. As before, we start with 
an abstract definition. 

Definition 8.4.5 (Operator Convex Function). Let f : I be a function on an interval I of the 

real line. The function f is operator convex on I when 

fHA + iH) y t • /(A) + f • f[H ) for all t e [0,1] 

and for all Hermitian matrices A and H whose eigenvalues are contained in I. A function g: I —► R 
is operator concave when-g is operator convex on I. 

We continue with some important facts about operator convex functions. Most of these 
claims can be derived easily. 

• When y > 0, the quadratic function a + ft+ jt 2 is operator convex on the real line. 

• The exponential map t >-»■ e f is not operator convex on the real line. 

• When a > 0 and / is operator convex on I, the function af is operator convex in I. 

•If / and g are operator convex on I, then f + g is operator convex on I. 

The operator monotone functions form a convex cone. We also learn that the family of operator 
convex functions is somewhat smaller than the family of convex functions. 

8.4.6 The Inverse is Operator Convex 

The inverse provides a very important example of an operator convex function. 

Proposition 8.4.6 (Inverse is Operator Convex). For each u > 0, the function a >-»■ [a+ uy 1 is 
operator convex on the positive real line. That is, for positive-definite matrices A and H, 

HA + fH + uir 1 y t ■ {A + uiy 1 +f-{H+ uiy 1 forr e [0,1], 

To establish Proposition 8.4.6, we use an argument based on the Schur complement lemma. 
For completeness, let us state and prove this important fact. 
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Fact 8.4.7 (Schur Complements). Suppose that T is a positive-definite matrix. Then 


0 =^ 


T 

B* 


B 

M 


if and only if B* T 1 B =<! M. 


(8.4.1) 


Proof of Fact 8.4. 7. To see why this is true, just calculate that 


T 

B 

I 

- T l B 


T 0 

[fi* 

M 

0 

I 


0 M-B* T~ l B 


In essence, we are performing block Gaussian elimination to bring the original matrix into block- 
diagonal form. Now, the Conjugation Rule (2.1.12) ensures that the central matrix on the left is 
positive semidefinite together with the matrix on the right. From this equivalence, we extract the 
result (8.4.1). □ 


We continue with the proof that the inverse is operator convex. 

Proof of Proposition 8.4.6. The Schur complement lemma, Fact 8.4.7, provides that 


0 =* 


T 

I 



whenever T is positive definite. 


Applying this observation to the positive-definite matrices A+ ul and H+ ul, we see that 



A+ ul I 


H+ul I 


T • 

i (A+z/ir 1 

+ T • 

I (if + zzl)' 1 

tA + tH + ul 


I 



I T • (A + zzl) -1 

+t -(ff+ zzir 1 



Since the top-left block of the latter matrix is positive definite, another application of Fact 8.4.7 
delivers the relation 


(tA + tH+ ul) 1 =$ t • {A + zzl) 1 +t- (ff + ul) 1 . 

This is the advertised conclusion. □ 

8.4.7 The Logarithm is Operator Concave 

We are finally prepared to verify that the logarithm is operator concave. The argument is based 
on the integral representation from Proposition 8.4.4 and the convexity of the inverse map from 
Proposition 8.4.6. 

Proposition 8.4.8 (Logarithm is Operator Concave). The logarithm is operator concave on the 
positive real line. That is, for positive-definite matrices A and H, 

r • log A + f • log H =4 log(rA+ f H) forr e [0,1], 

Proof. For each u > 0, Proposition 8.4.6 demonstrates that 

-t• [A+ zzl)' 1 -f • (if + zzl) -1 ^ -(tA + tH+ zzl)' 1 . 
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Invoke the integral representation of the logarithm from Proposition 8.4.1 to see that 

poo poo 

r • log A + f • log H = t • I [(1 + w) -1 I- [A+ ul)~ l ] du + f • / [(1 + u)~ l \- (if+ ul)~ l ] du 

Jo Jo 

poo 

= / [(1 + u)~ l l- (r- (j4+ ul)~ l +f • [H+ wl) -1 )] du 

Jo 

J f»oo 

[(1 + n) -1 I- [tA+tH + «I) _1 1 du — log(Tj4 + f H). 
o 

Once again, we have used the fact that integration preserves the semidefinite order. □ 


8.5 The Operator Jensen Inequality 

Convexity is a statement about how a function interacts with averages. By definition, a function 
/: I —*• IR is convex when 

f(ra + ih) < t • f(a) + f •/(/?) for all x e [0,1] and all a, h e I. (8.5.1) 

The convexity inequality (8.5.1) automatically extends from an average involving two terms to 
an arbitrary average. This is the content of Jensen’s inequality. 

Definition 8.4.5, of an operator convex function /: I —>• IR, is similar in spirit: 

f(rA+ f H) =4 t ■ f{A) + f • f{H) for all t e [0,1] (8.5.2) 

and all Hermitian matrices A and H whose eigenvalues are contained in I. Surprisingly, the 
semidefinite relation (8.5.2) automatically extends to a large family of matrix averaging opera¬ 
tions. This remarkable property is called the operator Jensen inequality. 

8.5.1 Matrix Convex Combinations 

In a vector space, convex combinations provide a natural method of averaging. But matrices 
have a richer structure, so we can consider a more general class of averages. 

Definition 8.5.1 (Matrix Convex Combination). Let A\ and A 2 be Hermitian matrices. Consider 
a decomposition of the identity of the form 


K; K, + K2K2 = I. 


Then the Hermitian matrix 

K* A± /Ti + K* A 2 K 2 (8.5.3) 

is called a matrix convex combination of A \ and A 2 . 

To see why it is reasonable to call (8.5.3) an averaging operation on Hermitian matrices, let 
us note a few of its properties. 

• Definition 8.5.1 encompasses scalar convex combinations because we can take K\ - r 1/2 I 
and K 2 — f 1/2 I. 

• The matrix convex combination preserves the identity matrix: K* { DC + K£ IK 2 = I. 
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• The matrix convex combination preserves positivity: 

KIAiKi + K 2 A' 2 K 2 (= 0 for all positive-semidefinite A\ and A 2 . 

• If the eigenvalues of A\ and A 2 are contained in an interval 7, then the eigenvalues of the 
matrix convex combination (8.5.3) are also contained in 7. 

We will encounter a concrete example of a matrix convex combination later when we prove The¬ 
orem 8.6.2. 


8.5.2 Jensen’s Inequality for Matrix Convex Combinations 

Operator convexity is a self-improving property. Even though the definition of an operator con¬ 
vex function only involves a scalar convex combination, it actually contains an inequality for 
matrix convex combinations. This is the content of the operator Jensen inequality. 

Theorem 8.5.2 (Operator Jensen Inequality). Let f be an operator convex function on an interval 
I of the real line, and let A\ and A 2 be Hermitian matrices with eigenvalues in I. Consider a 
decomposition of the identity 

Kf Ki + K 2 K 2 - I. (8.5.4) 

Then 

f[K^AiKi + K*A 2 K 2 ) 4 KffiACKt + K* 2 f{A 2 )K 2 . 

Proof. Let us introduce a block-diagonal matrix: 

f{A Y ) 0 
0 f{A 2 ) ■ 

Indeed, the matrix A lies in the domain of / because its eigenvalues fall in the interval 7. We can 
apply a standard matrix function to a block-diagonal matrix by applying the function to each 
block. 

There are two main ingredients in the argument. The first idea is to realize the matrix con¬ 
vex combination of Ai and A 2 by conjugating the block-diagonal matrix A with an appropriate 
unitary matrix. To that end, let us construct a unitary matrix 




Ai 0 

0 a 2 


for which f(A) - 


Ki Lr 


k 2 l 2 


where Q* Q - I and QQ* = I. 


To see why this is possible, note that the first block of columns is orthonormal: 



* 

*1 

k 2 


k 2 


= 7C, K\ + K 2 K 2 = I. 


As a consequence, we can choose L\ and L 2 to complete the unitary matrix Q. By direct compu¬ 
tation, we find that 


Q* AQ = 


k;a\K\ + k*a 2 k 2 

* 


* 

* 


(8.5.5) 


We have omitted the precise values of the entries labeled * because they do not play a role in our 
argument. 
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The second idea is to restrict the block matrix in (8.5.5) to its diagonal. To perform this ma¬ 
neuver, we express the diagonalizing operation as a scalar convex combination of two unitary 
conjugations, which gives us access to the operator convexity of /. Let us see how this works. 
Define the unitary matrix 


The key observation is that, for any block matrix, 


1 

T 

B 

1 * 

T 

B 

U = 

T 

0 

2 

B* 

M 

+ -u* 

2 

B* 

M 

0 

M 


(8.5.6) 


Another advantage of this construction is that we can easily apply a standard matrix function to 
the block-diagonal matrix. 

Together, these two ideas lead to a succinct proof of the operator Jensen inequality. Write 
[•In for the operation that returns the (1,1) block of a block matrix. We may calculate that 


/ (k; A 1 K 1 + k;a 2 K 2 ) = f [IQ* AQ\n) 


= f 


f 



li 


^Q*AQ+ U*{Q*AQ)U 

^Q* AQ+^(QU)* A{QU) 
f{Q*AQ)+ l -f{(QU)*A(QU)) 


li 


li 


The first identity depends on the representation (8.5.5) of the matrix convex combination as 
the (1,1) block of Q*AQ. The second line follows because the averaging operation presented 
in (8.5.6) does not alter the (1,1) block of the matrix. In view of (8.5.6), we are looking at the 
(1,1) block of the matrix obtained by applying / to a block-diagonal matrix. This is equivalent to 
applying the function / inside the (1,1) block, which gives the third line. Last, the semidefinite 
relation follows from the operator convexity of / on the interval I. 

We complete the argument by reversing the steps we have taken so far. 


f[K*A 1 K 1 + K*A 2 K 2 )^ 


\q* /L4)Q+V(Q* f(A)Q)U 


11 


= [Q*f{A)Q] n 
= Kff(A ] )K ] +K*f{A 2 )K 2 . 


To obtain the first relation, recall that a standard matrix function commutes with unitary conju¬ 
gation. The second identity follows from the formula (8.5.6) because diagonalization preserves 
the (1,1) block. Finally, we identify the (1,1) block of Q* f(A)Q just as we did in (8.5.5). This step 
depends on the fact that the diagonal blocks of f(A) are simply f(A\ ) and f{A 2 ). □ 


8.6 The Matrix Perspective Transformation 

To show that the vector relative entropy is convex, we represented it as the perspective of a con¬ 
vex function. To demonstrate that the matrix relative entropy is convex, we are going to perform 
a similar maneuver. This section develops an extension of the perspective transformation that 
applies to operator convex functions. Then we demonstrate that this matrix perspective has a 
strong convexity property with respect to the semidefinite order. 
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8.6.1 The Matrix Perspective 

In the scalar setting, the perspective transformation converts a convex function into a bivariate 
convex function. There is a related construction that applies to an operator convex function. 

Definition 8.6.1 (Matrix Perspective). Let f : R_ + —-U be an operator convex function, and let A 
and H be positive-definite matrices of the same size. Define the perspective map 

Y f [A\ H) - A 112 ■ f[A~ 112 HA~ 112 ) ■ A 112 . 

The notation A 112 refers to the unique positive-definite square root of A, and A~ 112 denotes the 
inverse of this square root. 

The Conjugation Rule (2.1.12) ensures that all the matrices involved remain positive definite, so 
this definition makes sense. To see why the matrix perspective extends the scalar perspective, 
notice that 

¥ f{A\H) - A f{HA~ l ) when A and H commute. (8.6.1) 

This formula is valid because commuting matrices are simultaneously diagonalizable. We will 
use the matrix perspective in a case where the matrices commute, but it is no harder to analyze 
the perspective without this assumption. 

8.6.2 The Matrix Perspective is Operator Convex 

The key result is that the matrix perspective is an operator convex map on a pair of positive- 
definite matrices. This theorem follows from the operator Jensen inequality in much the same 
way that Fact 8.2.4 follows from scalar convexity. 

Theorem 8.6.2 (Matrix Perspective is Operator Convex). Let f : RL + —► IR be an operator convex 
function. Let At and H, be positive-definite matrices of the same size. Then 

y V f [TA 1 +fA 2 ; THi + fH 2 )^T-'P / (i4i; Hi) + f • W f [A 2 ; H 2 ) fort £[0,1]. 

Proof. Let / be an operator convex function, and let 'Vf be its perspective transform. Fix pairs 
(A\, H\) and ( A 2 ,H 2 ) of positive-definite matrices, and choose an interpolation parameter r e 
[0,1]. Form the scalar convex combinations 

A - tA\ + f A 2 and H - tH\+tH 2 . 

Our goal is to bound the perspective Y f{A\H) as a scalar convex combination of its values 
'Vf{A\; Hi) and 'l , f(A 2 ] H 2 ). The idea is to introduce matrix interpolation parameters: 

Ki = t 1I2 A\ I2 A~ 112 and K 2 = f : ll2 A\ l2 A~ m . 

Observe that these two matrices decompose the identity: 

K*\K\ + K 2 K 2 = t- A~ 112 AiA~ 112 + f • A~ 1I2 A 2 A~ V2 = A~ 1I2 AA~ 112 = I. 

This construction allows us to express the perspective using a matrix convex combination, which 
gives us access to the operator Jensen inequality. 
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We calculate that 

' ¥ f {A;H) = A 112 ■ f[A~ V2 HA~ 112 ) ■ A 1 ' 2 

= A 1/2 ■ f( t ■ A- XI2 H X A~ V2 + f • A- 1,2 H 2 A~ 112 ) ■ A m 
= A 112 ■ f[KfAi ll2 HiAi ll2 Ki + K2A 2 ll2 H 2 A z ll2 K 2 ) ■ A 1 ' 2 . 

The first line is simply the definition of the matrix perspective. In the second line, we use the 
definition of H as a scalar convex combination. Third, we introduce the matrix interpolation 
parameters through the expressions t 1,2 A~ 112 = A} /2 .Ki and f 1/2 A _1/2 = A\ I2 K 2 and their con¬ 
jugate transposes. To continue the calculation, we apply the operator Jensen inequality, Theo¬ 
rem 8.5.2, to reach 

W f (A-,H)4A ll2 -[K*-f{A^ ll2 H 1 A- V2 )-K 1 +K*-f{A2 ll2 H 2 A2 m )-K 2 }-A 112 
= r • A\< 2 • fiA^H,Ai 112 ) ■ A\< 2 + f • 4 /2 • f[A- 2 m H 2 A- 2 m ) • A. l / 2 
= T-Y f (A 1 ;H 1 ) + f-V f (A 2 ;H 2 ). 

We have also used the Conjugation Rule (2.1.12) to support the first relation. Finally, we recall 
the definitions of Ki and K 2 , and we identify the two matrix perspectives. □ 

8.7 The Kronecker Product 

The matrix relative entropy is a function of two matrices. One of the difficulties of analyzing this 
type of function is that the two matrix arguments do not generally commute with each other. 
As a consequence, the behavior of the matrix relative entropy depends on the interactions be¬ 
tween the eigenvectors of the two matrices. To avoid this problem, we will build matrices that do 
commute with each other, which simplifies our task considerably. 

8.7.1 The Kronecker Product 

Our approach is based on an fundamental object from linear algebra. We restrict our attention 
to the simplest version here. 

Definition 8.7.1 (Kronecker Product). Let A and H be Hermitian matrices with dimension dx d. 
The Kronecker product A® H is the d 2 x d 2 Hermitian matrix 


a\\H 

■ a\dH 

a d \H 

■ ClddH 


At first sight, the definition of the Kronecker product may seem strange, but it has many delight¬ 
ful properties. The rest of the section develops the basic facts about this construction. 


8.7.2 Linearity Properties 

First of all, a Kronecker product with the zero matrix is always zero: 


A®0 = 0®0 = 0® H for all A and H. 
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Next, the Kronecker product is homogeneous in each factor: 

[aA]® H - a{A® H) - A® [aH) foraelR. 

Furthermore, the Kronecker product is additive in each coordinate: 

[A\ + Af] ® H — A\ ® H+ A 2 ® H and A® + H 2 ) — A® H\ +A® H 2 - 
In other words, the Kronecker product is a bilinear operation. 

8.7.3 Mixed Products 

The Kronecker product interacts beautifully with the usual product of matrices. By direct calcu¬ 
lation, we obtain a simple rule for mixed products: 


{A l ®H l ){A 2 ®H 2 ] = {A 1 A 2 )®{H 1 H 2 ). (8.7.1) 

Since I ® I is the identity matrix, the identity (8.7.1) leads to a formula for the inverse of a Kro¬ 
necker product: 


(A® ii) -1 = (A -1 ) ® (if -1 ) when A and H are invertible. (8.7.2) 

Another important consequence of the rule (8.7.1) is the following commutativity relation: 

( A ® I) (I ® H) — (I ® H) ( A ® I) for all Hermitian matrices A and H. (8.7.3) 

This simple fact has great importance for us. 

8.7.4 The Kronecker Product of Positive Matrices 

As we have noted, the Kronecker product of two Hermitian matrices is itself an Hermitian matrix. 
In fact, the Kronecker product preserves positivity as well. 

Fact 8.7.2 (Kronecker Product Preserves Positivity). Let A and H be positive-definite matrices. 
Then A® H is positive definite. 

Proof. To see why, observe that 

A®H = [A 112 ® H V2 )[A 1I2 ®H 112 ). 

As usual, A 112 refers to the unique positive-definite square root of the positive-definite matrix 
A. We have expressed A ® H as the square of an Hermitian matrix, so it must be a positive- 
semidefinite matrix. To see that it is actually positive definite, we simply apply the inversion 
formula (8.7.2) to discover that A® His invertible. □ 

8.7.5 The Logarithm of a Kronecker Product 

As we have discussed, the matrix logarithm plays a central role in our analysis. There is an elegant 
formula for the logarithm of a Kronecker product that will be valuable to us. 


8.8. THE MATRIX RELATIVE ENTROPY IS CONVEX 
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Fact 8.7.3 (Logarithm of a Kronecker Product). Let A and H be positive-definite matrices. Then 

log(A® H) = (log A) ® 1 +1® (log H). 

Proof. The argument is based on the fact that the matrix logarithm is the functional inverse of the 
matrix exponential. Since the exponential of a sum of commuting matrices equals the product 
of the exponentials, we have 

exp [M ® I +1 ® r) = exp (Af ® I) • exp (I ® T). 

This formula relies on the commutativity relation (8.7.3). Applying the power series representa¬ 
tion of the exponential, we determine that 

CO Y CO 1 

exp(M®I) = Y — (Af®I) <? = Y —I = e M ®I. 

*=o q'- ? =o q\ 

The second identity depends on the rule (8.7.1) for mixed products, and the last identity follows 
from the linearity of the Kronecker product. A similar calculation shows that exp (I ® T) = I ® e r . 
In summary, 

exp [M ® I +1 ® T) = (e M ®l)(l®e T ) = e M ®e r 

We have used the product rule (8.7.1) again. To complete the argument, simply choose M = log A 
and T — log Ft and take the logarithm of the last identity. £3 

8.7.6 A Linear Map 

Finally, we claim that there is a linear map cp that extracts the trace of the matrix product from 
the Kronecker product. Let A and H be d x d Hermitian matrices. Then we define 

(p{A®H) = tr(AFf). (8.7.4) 

The map cp is linear because the Kronecker product A ® H tabulates all the pairwise products 
of the entries of A and H, and tr(AJT) is a sum of certain of these pairwise products. For our 
purposes, the key fact is that the map <p preserves the semidefinite order: 

y ji Ai®H i )p 0 implies y. <p{At ® Hi) > 0. (8.7.5) 

This formula is valid for all Hermitian matrices A/ and Hj. To see why (8.7.5) holds, simply note 
that the map can be represented as an inner product: 

(p(A® H) — i* (A® H)i where i:=vec(Irf). 

The vec operation stacks the columns of a d x d matrix on top of each other, moving from left to 
right, to form a column vector of length d 2 . 

8.8 The Matrix Relative Entropy is Convex 

We are finally prepared to establish Theorem 8.1.4, which states that the matrix relative entropy 
is a convex function. This argument draws on almost all of the ideas we have developed over the 
course of this chapter. 
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Consider the function f(a ) = a - 1 - log a, defined on the positive real line. This function is 
operator convex because it is the sum of the affine function a >->■ a - 1 and the operator convex 
function a*-*— log a. The negative logarithm is operator convex because of Proposition 8.4.8. 

Let A and H be positive-definite matrices. Consider the matrix perspective ¥ j- evaluated at 
the commuting positive-definite matrices A <8> I and I ® H: 

'F / L4®I; I® IT] = (A® I)-/((I® H)(A®I) _1 ) = (A® I)-/(A -1 ® if). 

We have used the simplified definition (8.6.1) of the perspective for commuting matrices, and we 
have invoked the rules (8.7.1) and (8.7.2) for arithmetic with Kronecker products. Introducing the 
definition of the function /, we find that 

T/fA® I; I® If) = (A® I) • [A -1 ® H-I® I-log(A _1 ® H)] 

= I®H-A®I-(A®I)- [(log A -1 ) ® I + I® (logH)] 

= (A log A) ® I - A® (log H) -(A®I-I®ff) 

To reach the second line, we use more Kronecker product arithmetic, along with Fact 8.7.3, the 
law for calculating the logarithm of the Kronecker product. The last line depends on the property 
that log(A _1 ) = - log A. Applying the linear map </> from (8.7.4) to both sides, we reach 

(<po'P / )(A®I; I® H) = tr[AlogA- AlogH- (A - H)] = D(A;H). (8.8.1) 

We have represented the matrix relative entropy in terms of a matrix perspective. 

Let A; and Hi be positive-definite matrices, and fix a parameter r £ [0,1]. Theorem 8.6.2 tells 
us that the matrix perspective is operator convex: 

'P f{r ■ (Ai ® I) + f • (A 2 ® I); r • (I ® Hi) + f • (I ® H 2 )) 

^t-T / (A 1 ®I; I®Hi) + f •T / (A 2 ®I; I®H 2 ). 

The inequality (8.7.5) states that the linear map (p preserves the semidefinite order. 

[(po x Pf')(r ■ (Ai ® I) + f • (A 2 ® I); r • (I® Hi) + f • (I® H 2 )) 

< t • (cpo Yj)(Ai ® I; I® Hi) + f • (cpoTj)(A 2 ® I; I® H 2 ). 

Introducing the formula (8.8.1), we conclude that 

D(tAi + f A 2 ; tHi + f H 2 ) < t • D(A X ; Hi) + f • D(A 2 ; H 2 ). 

The matrix relative entropy is convex. 


8.9 Notes 

The material in this chapter is drawn from a variety of sources, ranging from textbooks to lecture 
notes to contemporary research articles. The best general sources include the books on ma¬ 
trix analysis by Bhatia [Bha97, Bha07] and by Hiai & Petz [HP14]. We also recommend a set of 
notes [CarlO] by Eric Carlen. More specific references appear below. 


8.9. NOTES 
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8.9.1 Lieb’s Theorem 

Theorem 8.1.1 is one of the major results in the important paper [Lie73] of Elliott Lieb on con¬ 
vex trace functions. Lieb wrote this paper to resolve a conjecture of Wigner, Yanase, & Dyson 
about the concavity properties of a certain measure of information in a quantum system. He 
was also motivated by a conjecture that quantum mechanical entropy satisfies a strong subaddi¬ 
tivity property. The latter result states that our uncertainty about a partitioned quantum system 
is controlled by the uncertainty about smaller parts of the system. See Carlen’s notes [CarlO] for 
a modern presentation of these ideas. 

Lieb derived Theorem 8.1.1 as a corollary of another difficult concavity theorem that he de¬ 
veloped [Lie73, Thm. 1]. The most direct proof of Lieb’s Theorem is probably Epstein’s argument, 
which is based on methods from complex analysis [Eps73]; see Ruskai’s papers [Rus02, Rus05] 
for a condensed version of Epstein’s approach. The proof that appears in Section 8.1 is due to 
the author of these notes [Trol2]; this technique depends on ideas developed by Carlen & Lieb 
to prove some other convexity theorems [CL08, §5]. 

In fact, many deep convexity and concavity theorems for trace functions are equivalent with 
each other, in the sense that the mutual implications follow from relatively easy arguments. 
See [Lie73, §5] and [CL08, §5] for discussion of this point. 

8.9.2 The Matrix Relative Entropy 

Our definition of matrix relative entropy differs slightly from the usual definition in the literature 
on quantum statistical mechanics and quantum information theory because we have included 
an additional linear term. This alteration does not lead to substantive changes in the analysis. 

The fact that matrix relative entropy is nonnegative is a classical result attributed to Klein. 
See [Pet94, §2] or [CarlO, §2.3]. 

Lindblad [Lin73] is credited with the result that matrix relative entropy is convex, as stated 
in Theorem 8.1.4. Lindblad derived this theorem as a corollary of Lieb’s results from [Lie73], 
Bhatia [Bha97, Chap. IX] gives two alternative proofs, one due to Connes & Stormer [CS75] and 
another due to Petz [Pet86] . There is also a remarkable proof due to Ando [And79, Thm. 7]. 

Our approach to Theorem 8.1.4 is adapted directly from a recent paper of Effros [Eff09], Nev¬ 
ertheless, many of the ideas date back to the works cited in the last paragraph. 

8.9.3 The Relative Entropy for Vectors 

The treatment of the relative entropy for vectors in Section 8.2 is based on two classical methods 
for constructing divergences. To show that the relative entropy is nonnegative, we represent it as 
a Bregman divergence [Bre67]. To show that the relative entropy is convex, we represent it as an 
/-divergence [AS66, Csi67]. Let us say a few more words about these constructions. 

Suppose that / is a differentiable convex function on M d . Bregman considered divergences 
of the form 

B f{a;h)f(a) - [f(h) - (Vf{h), a—h)]. 

Since / is convex, the Bregman divergence B^ is always nonnegative. In the vector setting, there 
are two main examples of Bregman divergences. The function f(a) = 11| a\\^ leads to the squared 
Euclidean distance, and the function f(a) - Iogfl; - a{) leads to the vector relative entropy. 
Bregman divergences have many geometric properties in common with these two functions. For 
an introduction to Bregman divergences for matrices, see [DT07] . 
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Suppose that /: K ++ —*• [R is a convex function. Ali & Silvey [AS66] and Csiszar [Csi67] consid¬ 
ered divergences of the form 

C f(a;h) f(hilcii). 

We recognize this expression as a perspective transformation, so the /-divergence Cf is always 
convex. The main example is based on the Shannon entropy f{a ) = a log a, which leads to 
a cousin of the vector relative entropy. The paper [RW11] contains a recent discussion of /- 
divergences and their applications in machine learning. Petz has studied functions related to 
/-divergences in the matrix setting [Pet86, PetlO]. 

8.9.4 Elementary Trace Inequalities 

The material in Section 8.3 on trace functions is based on classical results in quantum statistical 
mechanics. We have drawn the arguments in this section from Petz’s survey [Pet94, Sec. 2] and 
Carlen’s lecture notes [Carlo, Sec. 2.2], 


8.9.5 Operator Monotone & Operator Convex Functions 


The theory of operator monotone functions was initiated by Lowner [Low34]. He developed a 
characterization of an operator monotone function in terms of divided differences. For a func¬ 
tion /, the first divided difference is the quantity 


f[a, h] 


' /(a)-/(fe) 

a -h 
l f'(a), 


hf a 
h- a. 


Lowner proved that / is operator monotone on an interval 1 if and only we have the semidefmite 
relation 

f[a\,ci\] ... /[fli.^dl 


)p 0 for all {a, } c I and all d £ l\l. 


f[ad,a\] ... f[ad,ad] 


This result is analogous with the fact that a smooth, monotone scalar function has a nonnegative 
derivative. Lowner also established a connection between operator monotone functions and 
Pick functions from the theory of complex variables. A few years later, Kraus introduced the 
concept of an operator convex function in [Kra36], and he developed some results that parallel 
Lowner’s theory for operator monotone functions. 

Somewhat later, Bendat & Sherman [BS55] developed characterizations of operator mono¬ 
tone and operator convex functions based on integral formulas. For example, / is an operator 
monotone function on (0,oo) if and only if it can be written in the form 


J f*°° ut r°° u 

-d p{u) where /> 0 and / -dp(u)<oo. 

0 U + t Jo 1 + 11 


Similarly, / is an operator convex function on [0, oo) if and only if it can be written in the form 


fit) - a + pt + jt 


Jo u+t 


In both cases, dp is a nonnegative measure. The integral representation of the logarithm in 
Proposition 8.4.1 is closely related to these formulas. 
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We have taken the proof that the matrix inverse is monotone from Bhatia’s book [Bha97, 
Prop. V.1.6]. The proof that the matrix inverse is convex appears in Ando’s paper [And79]. Our 
treatment of the matrix logarithm was motivated by a conversation with Eric Carlen at an IPAM 
workshop at Lake Arrowhead in December 2010. 

For more information about operator monotonicity and operator convexity, we recommend 
Bhatia’s books [Bha97, Bha07], Carlen’s lecture notes [CarlO], and the book of Hiai & Petz [HP14]. 

8.9.6 The Operator Jensen Inequality 

The paper [HP82] of Hansen & Pedersen contains another treatment of operator monotone and 
operator convex functions. The highlight of this work is a version of the operator Jensen in¬ 
equality. Theorem 8.5.2 is a refinement of this result that was established by the same authors 
two decades later [HP03]. Our proof of the operator Jensen inequality is drawn from Petz’s 
book [Petll, Thm. 8.4]; see also Carlen’s lecture notes [CarlO, Thm. 4.20]. 

8.9.7 The Matrix Perspective & the Kronecker Product 

We have been unable to identify the precise source of the idea that a bivariate matrix function can 
be represented in terms of a matrix perspective. Two important results in this direction appear 
in Ando’s paper [And79, Thms. 6 and 7]. 

/ positive and operator concave on (0,oo) implies 

( A, H) >-»■ {A <8 1) • f[A~ 1 s> H) is operator concave 

on pairs of positive-definite matrices. Similarly, 

/ operator monotone on (0, oo) implies (A,H) >-► (A®I)-/(A® if -1 ) is operator convex 

on pairs of positive-definite matrices. Ando proves that the matrix relative entropy is convex by 
applying the latter result to the matrix logarithm. We believe that Ando was the first author to 
appreciate the value of framing results of this type in terms of the Kronecker product, and we 
have followed his strategy here. On the other hand, Ando’s analysis is different in spirit because 
he relies on integral representations of operator monotone and convex functions. 

In a subsequent paper [KA80], Kubo & Ando constructed operator means using a related ap¬ 
proach. They show that 

/ positive and operator monotone on (0,oo) implies 

(A, H) ~ A 112 ■ f[A~ 1,2 HA~ 112 ) ■ A 1 ' 2 is operator concave 

on pairs of positive-definite matrices. Kubo & Ando point out that particular cases of this con¬ 
struction appear in the work of Pusz & Woronowicz [PW75] . This is the earliest citation where we 
have seen the matrix perspective black-on-white. 

A few years later, Petz introduced a class of quasi-entropies for matrices [Pet86] . These func¬ 
tions also involve a perspective-like construction, and Petz was clearly influenced by Csiszar’s 
work on /-divergences. See [PetlO] for a contemporary treatment. 

The presentation in these notes is based on a recent paper [Eff09] of Effros. He showed that 
convexity properties of the matrix perspective follow from the operator Jensen inequality, and he 
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derived the convexity of the matrix relative entropy as a consequence. Our analysis of the matrix 
perspective in Theorem 8.6.2 is drawn from a subsequent paper [ENG11], which removes some 
commutativity assumptions from Effros’s argument. 

The proof in §8.8 that the matrix relative entropy is convex, Theorem 8.1.4, recasts Effros’s 
argument [Eff09, Cor. 2.2] in the language of Kronecker products. In his paper, Effros works with 
left- and right-multiplication operators. To appreciate the connection, simply note the identities 

(21® I)vec(M) = vec(MA). and (I® H)vec(M) = vec(ffM). 

In other words, the matrix A ® I can be interpreted as right-multiplication by A, while the matrix 
I ® H can be interpreted as left-multiplication by H. (The change in sense is an unfortunate 
consequence of the definition of the Kronecker product.) 


Matrix Concentration: Resources 


This annotated bibliography describes some papers that involve matrix concentration inequali¬ 
ties. Right now, this presentation is heavily skewed toward theoretical results, rather than appli¬ 
cations of matrix concentration. 

Exponential Matrix Concentration Inequalities 

We begin with papers that contain the most current results on matrix concentration. 

• [Trollc]. These lecture notes are based heavily on the research described in this paper. 
This work identifies Lieb’s Theorem [Lie73, Thm. 6] as the key result that animates expo¬ 
nential moment bounds for random matrices. Using this technique, the paper develops 
the bounds for matrix Gaussian and Rademacher series, the matrix Chernoff inequalities, 
and several versions of the matrix Bernstein inequality. In addition, it contains a matrix 
Hoeffding inequality (for sums of bounded random matrices), a matrix Azuma inequal¬ 
ity (for matrix martingales with bounded differences), and a matrix bounded difference 
inequality (for matrix-valued functions of independent random variables). 

• [Trol2]. This note describes a simple proof of Lieb’s Theorem that is based on the joint con¬ 
vexity of quantum relative entropy. This reduction, however, still involves a deep convexity 
theorem. Chapter 8 contains an explication of this paper. 

• [OlilOa] . Oliveira’s paper uses an ingenious argument, based on the Golden-Thompson 
inequality (3.3.3), to establish a matrix version of Freedman’s inequality. This result is, 
roughly, a martingale version of Bernstein’s inequality. This approach has the advantage 
that it extends to the fully noncommutative setting [JZ12]. Oliveira applies his results to 
study some problems in random graph theory. 

• [Trolla]. This paper shows that Lieb’s Theorem leads to a Freedman-type inequality for 
matrix-valued martingales. The associated technical report [Trollb] describes additional 
results for matrix-valued martingales. 

• [GT14]. This article explains how to use the Lieb-Seiringer Theorem [LS05] to develop tail 
bounds for the interior eigenvalues of a sum of independent random matrices. It con¬ 
tains a Chernoff-type bound for a sum of positive-semidefmite matrices, as well as several 
Bernstein-type bounds for sums of bounded random matrices. 

• [MJC + 14]. This paper contains a strikingly different method for establishing matrix con¬ 
centration inequalities. The argument is based on work of Sourav Chatterjee [Cha07] that 
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shows how Stein’s method of exchangeable pairs [Ste72] leads to probability inequalities. 
This technique has two main advantages. First, it gives results for random matrices that are 
based on dependent random variables. As a special case, the results apply to sums of inde¬ 
pendent random matrices. Second, it delivers both exponential moment bounds and poly¬ 
nomial moment bounds for random matrices. Indeed, the paper describes a Bernstein- 
type exponential inequality and also a Rosenthal-type polynomial moment bound. Fur¬ 
thermore, this work contains what is arguably the simplest known proof of the noncom- 
mutative Khintchine inequality. 

• [PMT14] . This paper improves on the work in [MJC + 14] by extending an argument, based 
on Markov chains, that was developed in Chatterjee’s thesis [ChaOS]. This analysis leads 
to satisfactory matrix analogs of scalar concentration inequalities based on logarithmic 
Sobolev inequalities. In particular, it is possible to develop a matrix version of the expo¬ 
nential Efron-Stein inequality in this fashion. 

• [CGT12a, CGT12b]. The primary focus of this paper is to analyze a specific type of proce¬ 
dure for covariance estimation. The appendix contains a new matrix moment inequality 
that is, roughly, the polynomial moment bound associated with the matrix Bernstein in¬ 
equality. 

• [Roll 1] . These lecture notes use matrix concentration inequalities as a tool to study some 
estimation problems in statistics. They also contain some matrix Bernstein inequalities for 
unbounded random matrices. 

• [GN] . Gross and Nesme show how to extend Hoeffding’s method for analyzing sampling 
without replacement to the matrix setting. This result can be combined with a variety of 
matrix concentration inequalities. 

• [Trolld]. This paper combines the matrix Chernoff inequality, Theorem 5.1.1, with the 
argument from [GN] to obtain a matrix Chernoff bound for a sum of random positive- 
semidefmite matrices sampled without replacement from a fixed collection. The result is 
applied to a random matrix that plays a role in numerical linear algebra. 

• [CT14] . This paper establishes logarithmic Sobolev inequalities for random matrices, and 
it derives some matrix concentration inequalities as a consequence. The methods in the 
paper have applications in quantum information theory, although the matrix concentra¬ 
tion bounds are inferior to related results derived using Stein’s method. 


Bounds with Intrinsic Dimension Parameters 

The following works contain matrix concentration bounds that depend on a dimension param¬ 
eter that may be smaller than the ambient dimension of the matrix. 

• [OlilOb] . Oliveira shows how to develop a version of Rudelson’s inequality [Rud99] using 
a variant of the argument of Ahlswede & Winter from [AW02] . Oliveira’s paper is notable 
because the dimensional factor is controlled by the maximum rank of the random matrix, 
rather than the ambient dimension. 
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• [MZ11], This work contains a matrix Chernoff bound for a sum of independent positive- 
semidefinite random matrices where the dimensional dependence is controlled by the 
maximum rank of the random matrix. The approach is, essentially, the same as the ar¬ 
gument in Rudelson’s paper [Rud99]. The paper applies these results to study randomized 
matrix multiplication algorithms. 

• [HKZ12]. This paper describes a method for proving matrix concentration inequalities 
where the ambient dimension is replaced by the intrinsic dimension of the matrix vari¬ 
ance. The argument is based on an adaptation of the proof in [Trolla]. The authors give 
several examples in statistics and machine learning. 

• [Mini 1] . This work presents a more refined technique for obtaining matrix concentration 
inequalities that depend on the intrinsic dimension, rather than the ambient dimension. 
This paper motivated the results in Chapter 7. 

The Method of Ahlswede & Winter 

Next, we list some papers that use the ideas from the work [AW02] of Ahslwede & Winter to obtain 
matrix concentration inequalities. In general, these results have suboptimal parameters, but 
they played an important role in the development of this held. 

• [AW02]. The original paper of Ahlswede & Winter describes the matrix Laplace transform 
method, along with a number of other foundational results. They show how to use the 
Golden-Thompson inequality to bound the trace of the matrix mgf, and they use this tech¬ 
nique to prove a matrix Chernoff inequality for sums of independent and identically dis¬ 
tributed random variables. Their main application concerns quantum information theory. 

• [CM08]. Christohdes and Markstrom develop a Hoeffding-type inequality for sums of 
bounded random matrices using the approach of Ahlswede & Winter. They apply this re¬ 
sult to study random graphs. 

• [Gro 11]. Gross presents a matrix Bernstein inequality based on the method of Ahlswede & 
Winter, and he uses it to study algorithms for matrix completion. 

• [Recll]. Recht describes a different version of the matrix Bernstein inequality, which also 
follows from the technique of Ahlswede & Winter. His paper also concerns algorithms for 
matrix completion. 

Noncommutative Moment Inequalities 

We conclude with an overview of some major works on bounds for the polynomial moments 
of a noncommutative martingale. Sums of independent random matrices provide one concrete 
example where these results apply. The results in this literature are as strong, or stronger, than 
the exponential moment inequalities that we have described in these notes. Unfortunately, the 
proofs are typically quite abstract and difficult, and they do not usually lead to explicit constants. 
Recently there has been some cross-fertilization between noncommutative probability and the 
field of matrix concentration inequalities. 

Note that “noncommutative” is not synonymous with “matrix" in that there are noncom¬ 
mutative von Neumann algebras much stranger than the familiar algebra of finite-dimensional 
matrices equipped with the operator norm. 


146 


MATRIX CONCENTRATION: RESOURCES 


• [TJ74] . This classic paper gives a bound for the expected trace of an even power of a matrix 
Rademacher series. These results are important, but they do not give the optimal bounds. 

• [LP86] . This paper gives the first noncommutative Khintchine inequality, a bound for the 
expected trace of an even power of a matrix Rademacher series that depends on the matrix 
variance. 

• [LPP91], This work establishes dual versions of the noncommutative Khintchine inequal¬ 
ity. 

• [BucOl, Buc05], These papers prove optimal noncommutative Khintchine inequalities in 
more general settings, and they obtain sharp constants. 

• [JX03, JX08] . These papers establish noncommutative versions of the Burkholder-Davis- 
Gundy inequality for martingales. They also give an application of these results to random 
matrix theory. 

• [JX05] . This paper contains an overview of noncommutative moment results, along with 
information about the optimal rate of growth in the constants. 

• [JZ13] . This paper describes a fully noncommutative version of the Bennett inequality. The 
proof is based on the method of Ahlswede & Winter [AW02] . 

• [JZ12]. This work shows how to use Oliveira’s argument [OlilOa] to obtain some results for 
fully noncommutative martingales. 

• [MJC + 14] . This work, described above, includes a section on matrix moment inequalities. 
This paper contains what are probably the simplest available proofs of these results. 

• [CGT12a] . The appendix of this paper contains a polynomial inequality for sums of inde¬ 
pendent random matrices. 
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